Train in large dataset - Githubissues

lopuhin / transformer-lm

Transformer language model (GPT-2) with sentencepiece tokenizer

164 stars 47 forks source link

Train in large dataset #3

Closed binhvq closed 5 years ago

binhvq commented 5 years ago

I'm trying training in 28GB text, but this large dataset. Can't encoded to npy and load to train beacause not enough RAM. My server has 80GB RAM. Thanks Konstantin Lopuhin.

lopuhin commented 5 years ago

@binhvq that's interesting, if the corpus is 28 GB uncompressed, then after encoding to npy it should be around 3x smaller (depends on language). At what stage does it fail? I guess at creating sentencepiece model? If yes, then there are some extra options you could pass to sentensepiece to make it use less memory, or you could build sentencepiece model on a sample of the corpus.

Also note that right now I'm working on pytorch branch which would use PyTorch instead of tensorflow, it already supports single-GPU training, plan to add multi-GPU this week, so if you are fine with pytorch you can try using this one instead (but it will use the same amount of memory).

binhvq commented 5 years ago

@lopuhin My corpus is uncompressed. Fail at convert corpus to npy. Load 3 part of corpus: train, test, valid is very large. Screenshot from 2019-04-09 15-19-37

lopuhin commented 5 years ago

Oh I see, thanks for clarification @binhvq , I didn't expect the code to fail at this stage, I thought that 80 GB of RAM would be more than enough to hold an uncompressed text corpus of 28 GB... I'll try to check if there are any unexpected leaks here. Do you know how much did it manage to process (ideally what percentage) before failing?

binhvq commented 5 years ago

@lopuhin f.readlines() -> With the command to read the file as above, at least take 28GB of RAM to read to the end file. And encoded variable is dictionary has 3 key train, valid, test. That will take a lot of RAM

lopuhin commented 5 years ago

Oh right, my bad - didn't think about this as I had a lot of small files instead. That's quite silly that readlines works this way, we should switch to reading the file without loading it all into memory, and hold only encoded bits (this could be optimized as well, but hopefully won't be required here).

binhvq commented 5 years ago

I'm trying using generater for large dataset. It's work but some feature not work, i'm can't using progess bar tqdm for train and validate.

lopuhin commented 5 years ago

@binhvq just to be clear - I think that it still will be possible to have numpy-encoded train/valid/test that would fit in memory for you, we only need to fix the code that produces these files, using another way to read the file, as you suggested.

binhvq commented 5 years ago

@lopuhin That's a good idea. I am trying 80GB RAM will handle 28GB text with my suggested(not use f.readlines()) About 8GB memory for 9m sentences, my corpus 28GB have 162m sentences. So ~ 150GB RAM is required to be able to encode my corpus.

lopuhin commented 5 years ago

@binhvq I did some memory optimizations eb101ba and e1f938b in master, this should allow to read the corpus using around 2x memory the size of encoded corpus, so in I hope you'd use around 10 - 20 GB of RAM, depending on whether the size of your sentencepiece vocabulary is smaller or larger than 65k symbols.

lopuhin commented 5 years ago

should be fixed now, closing