Closed ricsinaruto closed 4 years ago
I guess there's no easy way to solve this, here's what I tried and worked in lowering my RAM usage:
In the end I had to lower the number of candidates and the history size, but the biggest improvement was from implementing the Dataset class, and putting the padding inside the getitem function. I think this is the only way for this code to work for larger datasets, if the Dataset class is implemented from the beginning instead of building the whole dataset in memory.
First of all thank you for sharing this code!
I try running the train script on my own dataset, and it successfully generates a tokenized cached version which is about 6gb on disk.
What I don't understand is that it runs out of 125gb ram during the 'Building inputs and labels' phase.
I have not modified the code, I don't use personas, my candidate number is 2 and my history size 4.
Let me know if you have any ideas what could be the problem, or maybe this is perfectly normal?