Closed binhvq closed 5 years ago
@binhvq that's interesting, if the corpus is 28 GB uncompressed, then after encoding to npy it should be around 3x smaller (depends on language). At what stage does it fail? I guess at creating sentencepiece model? If yes, then there are some extra options you could pass to sentensepiece to make it use less memory, or you could build sentencepiece model on a sample of the corpus.
Also note that right now I'm working on pytorch
branch which would use PyTorch instead of tensorflow, it already supports single-GPU training, plan to add multi-GPU this week, so if you are fine with pytorch you can try using this one instead (but it will use the same amount of memory).
@lopuhin My corpus is uncompressed. Fail at convert corpus to npy. Load 3 part of corpus: train, test, valid is very large.
Oh I see, thanks for clarification @binhvq , I didn't expect the code to fail at this stage, I thought that 80 GB of RAM would be more than enough to hold an uncompressed text corpus of 28 GB... I'll try to check if there are any unexpected leaks here. Do you know how much did it manage to process (ideally what percentage) before failing?
@lopuhin f.readlines() -> With the command to read the file as above, at least take 28GB of RAM to read to the end file. And encoded variable is dictionary has 3 key train, valid, test. That will take a lot of RAM
Oh right, my bad - didn't think about this as I had a lot of small files instead. That's quite silly that readlines works this way, we should switch to reading the file without loading it all into memory, and hold only encoded bits (this could be optimized as well, but hopefully won't be required here).
I'm trying using generater for large dataset. It's work but some feature not work, i'm can't using progess bar tqdm for train and validate.
@binhvq just to be clear - I think that it still will be possible to have numpy-encoded train/valid/test that would fit in memory for you, we only need to fix the code that produces these files, using another way to read the file, as you suggested.
@lopuhin That's a good idea. I am trying 80GB RAM will handle 28GB text with my suggested(not use f.readlines()) About 8GB memory for 9m sentences, my corpus 28GB have 162m sentences. So ~ 150GB RAM is required to be able to encode my corpus.
@binhvq I did some memory optimizations eb101ba and e1f938b in master, this should allow to read the corpus using around 2x memory the size of encoded corpus, so in I hope you'd use around 10 - 20 GB of RAM, depending on whether the size of your sentencepiece vocabulary is smaller or larger than 65k symbols.
should be fixed now, closing
I'm trying training in 28GB text, but this large dataset. Can't encoded to npy and load to train beacause not enough RAM. My server has 80GB RAM. Thanks Konstantin Lopuhin.