Adding an option to process a folder containing many files for training rather than just one big train text corpus

maxoodf / word2vec

word2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation, written in C++11 from the scratch

Apache License 2.0

131 stars 24 forks source link

Adding an option to process a folder containing many files for training rather than just one big train text corpus #11

Closed nantonop closed 4 years ago

nantonop commented 4 years ago

Hi Max,

Your word2vec implementation compiles and works fine - is also fast. Nice work!

Will it be possible, instead of pointing to a single train text corpus file option (-f [file] or --train-file [file]) to also point to a directory containing many text files and use those text files in that directory as training corpus? (i.e. something like --train-directory ~/Desktop/trainningData)

Thanks!

maxoodf commented 4 years ago

Hello Nick, thank you. I do not think it is a good idea. I use file mapping into memory for fast parsing/tokenization. In case of many files it will be much slower. You can write a simple shell script to merge files from subdirectories to one large file.

nantonop commented 4 years ago

Hi Max,

Hmm, good point.

I'm asking this because my training data is about 400GB. I can merge all the text files but will there be any issues e.g. memory issues by processing such a huge text file?

Thanks,

Nick

maxoodf commented 4 years ago

Hi Nick, I've never tried to mmap so large files. But it is possible to mmap files larger than both the physical memory and swap space. In my experience I trained a w2v model on 60GB data set with 32GB RAM. By the way, whole Wikipedia (English texts) is about 20GB only...