Open SammyAgrawal opened 2 months ago
Questions top of mind:
Couple of comments:
This also means you are not using any paralellism (try to record the CPU useage while loading, I bet it never exceeds 100%).
Finally there might be some caching going on here, which could explain the fluctuations in the load time, even though these might also be random. Bottomline you should use bigger batches! Insert jaws meme
it seems if you iterated over the dataset and did this, eventually you would have loaded everything and kernel will crash. Can you "unload" such that once a batch is processed, you garbage collect it? If you overwrite the "batch" variable will it be automatically garbage collected and the memory will be freed?
I think as long as you overwrite the object you are good and the old data will be garbage collected.
Does loading in line with existing chunk dimensions matter? I.e. does the "start" affect load times if you try to load across chunk lines?
what matters most (I think) is how many of the chunks you have to load initially. If you cross chunk boundaries you will load all the chunks into memory that you touch.
If you use multiprocessing and spawn multiple processes, how does Dask handle loading across processes? Data balancing across N processes?
This might be a good read.
Wanted to open a thread to inquire about best practices regarding dask chunking.
Ok imagine you have ingested some dataset that is over 100GB, so definitely not fitting into memory. You want to train an ML model using this dataset.
Are there any dask optimizations for this process?
Ran a simple test:
Was surprised by the fact that batch size seemingly had no effect on load time.