perpetual-ml / perpetual

A self-generalizing gradient boosting machine which doesn't need hyperparameter optimization
https://perpetual-ml.com/
GNU Affero General Public License v3.0
293 stars 11 forks source link

Few Questions #3

Open jmrichardson opened 3 months ago

jmrichardson commented 3 months ago

Hi,

Thank you for this exciting package! I have a few questions that I could not find reference in the documentation or wasn't able to figure out on my own:

  1. Is there a way to set a custom objective?
  2. Does perpetual work with temporal data? Ie, does it maintain the order of the rows when training?
  3. I see you can manually set the budget. Is there a way to set a large budget and auto stop learning?
  4. Can you apply sample weights?

Thank you again

deadsoul44 commented 3 months ago

Hi, you are very welcome. Here are my answers:

  1. Currently, it is not possible to set a custom objective. We can make this possible using pyo3. Until then, submit an issue and I will add the necessary objective function.
  2. It doesn't maintain the order of rows. You can use the algorithm with temporal data like you do it with other GBM algorithms.
  3. Currently, you can set the budget as high as you like. The algorithm takes more boosting rounds with increasing budget. But the number of boosting rounds is internally limited to 10k. And it will print a warning with a suggestion to decrease the budget if it hits this number. The algorithm always automatically stops itself as described in the blog post.
  4. Yes, it is possible to apply sample weights. This is documented in the fit method.

Feel free to comment if you have any more questions.

francesco086 commented 6 days ago

May I ask one more question: I see that you support polars dataframes. Does this include LazyFrames?

deadsoul44 commented 6 days ago

Not yet. Usually, GBMs load whole data into memory and create binned version of it. Do you mean something like streaming a lazy frame into binned data in chunks so that avoid loading original data into memory?

francesco086 commented 6 days ago

I take the perspective of a use case where data is so large that does not fit into RAM by a long shot (imagine 1Tb of data and you have a 16Gb of RAM).

Something along the lines you outlined. Is there perhaps a way to created this binned version without loading the whole data in memory, but loading it little by little, ideally row by row? And each bin at the end could be a LazyFrame too.

I think that it could be an interesting feature to have! As far as I know, no gradient boosting model offer such a out-of-core computing feature.

deadsoul44 commented 6 days ago

Original data can be out of core, and binned data can be generated from a lazy frame, but binned data should be in memory because the algorithm re-shuffles rows after each split. It means that roughly 1/8 of the data size is enough for memory since original data is probably f64, and binned data can be u8. In this case, we need roughly 100 GB of memory for 1 TB of data. If you have this amount of memory, it is relatively easier to implement this kind of feature. If you insist on 16 GB, we need column and/or row subsampling of binned data. We should either write binned data to disk and read subsampled data at each boosting round or read original data and bin it on the fly. This needs more effort, and yes, as far as I know, the other GBMs support neither of these cases.