Open jmrichardson opened 3 months ago
Hi, you are very welcome. Here are my answers:
Feel free to comment if you have any more questions.
May I ask one more question: I see that you support polars dataframes. Does this include LazyFrames?
Not yet. Usually, GBMs load whole data into memory and create binned version of it. Do you mean something like streaming a lazy frame into binned data in chunks so that avoid loading original data into memory?
I take the perspective of a use case where data is so large that does not fit into RAM by a long shot (imagine 1Tb of data and you have a 16Gb of RAM).
Something along the lines you outlined. Is there perhaps a way to created this binned version without loading the whole data in memory, but loading it little by little, ideally row by row? And each bin at the end could be a LazyFrame too.
I think that it could be an interesting feature to have! As far as I know, no gradient boosting model offer such a out-of-core computing feature.
Original data can be out of core, and binned data can be generated from a lazy frame, but binned data should be in memory because the algorithm re-shuffles rows after each split. It means that roughly 1/8 of the data size is enough for memory since original data is probably f64, and binned data can be u8. In this case, we need roughly 100 GB of memory for 1 TB of data. If you have this amount of memory, it is relatively easier to implement this kind of feature. If you insist on 16 GB, we need column and/or row subsampling of binned data. We should either write binned data to disk and read subsampled data at each boosting round or read original data and bin it on the fly. This needs more effort, and yes, as far as I know, the other GBMs support neither of these cases.
Hi,
Thank you for this exciting package! I have a few questions that I could not find reference in the documentation or wasn't able to figure out on my own:
Thank you again