mandubian / pytorch_math_dataset

Pytorch Playground for Mathematical Reasoning Dataset
Other
19 stars 7 forks source link

bottle neck in reading datasets with opening pandas dataset every element? #2

Open ggaemo opened 4 years ago

ggaemo commented 4 years ago

Hi, thank you for releaseing the code base!

I have a question to ask:

in lazy mode, isn't too slow to open up a text file and make it into a pandas dataframe just to get one row?

If I want to get another row from the same text file, the dataset would have to open the file again and get just that certain row.

Also What do you mean by:

"""Stream loads math dataset file in a lazy way (optional) pandas is used for naive streaming as Python doesn't provide any better tool for that critical feature"""

??

Thank you very much for providing such a code base for a kick start for beginners!

ggaemo commented 4 years ago

Sorry, I read your code again.

So, if all the categories of the questions are at least once used for a batch, doesn't it mean that all the pandas dataframe are all loaded into the memory?

After a number of steps, all categories would have at least once be brought into the batch. Then there is no advantage in terms of memory(RAM). Am I right?

mandubian commented 4 years ago

Actually yes everything is loaded in memory to be efficient or as you supposed, it would take too long to read one row. But I rely on Pandas just because I know it works a lot with "views" of data transformation without data copy when not needed. I've recently been working on AI applied to programming language in this repo https://github.com/mandubian/codenets (code is much more evolved than the one in current repo) with code provided by Microsoft. This code was poorly managing memory and couldn't hold in my 32GB RAM... I had to rewrite it all and I used again pandas to limit recopy of data. I could finally reduce the size to hold in my 32GB but it was no easy. Pandas is just the least worst method that I found to be fast. Ideally I would rewrite it all using other methods or even binding to other languages like Rust but I haven't got time for that ;) For very big datasets, this method won't work and you'll have to manage data loading by chunk and batch construction yourself. But Python is a poor language for that (I have bigdata background so I know a bit about manipulating too big data).