huggingface / transfer-learning-conv-ai

🦄 State-of-the-Art Conversational AI with Transfer Learning
MIT License
1.73k stars 431 forks source link

Running out of memory (125gb) when building my dataset #62

Closed ricsinaruto closed 4 years ago

ricsinaruto commented 4 years ago

First of all thank you for sharing this code!

I try running the train script on my own dataset, and it successfully generates a tokenized cached version which is about 6gb on disk.

What I don't understand is that it runs out of 125gb ram during the 'Building inputs and labels' phase.

I have not modified the code, I don't use personas, my candidate number is 2 and my history size 4.

Let me know if you have any ideas what could be the problem, or maybe this is perfectly normal?

ricsinaruto commented 4 years ago

I guess there's no easy way to solve this, here's what I tried and worked in lowering my RAM usage:

In the end I had to lower the number of candidates and the history size, but the biggest improvement was from implementing the Dataset class, and putting the padding inside the getitem function. I think this is the only way for this code to work for larger datasets, if the Dataset class is implemented from the beginning instead of building the whole dataset in memory.