oracle / tribuo

Tribuo - A Java machine learning library
https://tribuo.org
Apache License 2.0
1.24k stars 172 forks source link

Memory and SQLDataSource #341

Open Kastanek opened 1 year ago

Kastanek commented 1 year ago

Ask the question Is training a model using SQLDataSource suitable for large datasets that do not fit in RAM? I expect my dataset to grow to hundreds of thousands of records. I see that batching is performed, but I'm not sure whether a model can be trained this way. I'm particularly interested in training with XGBoostRegressionTrainer. Is your question about a specific Tribuo class? SQLDataSource

Craigacp commented 1 year ago

We have trained XGBoost models in Tribuo with hundreds of thousands of records, though we used a fairly large machine to do so. Batch loading from the SQL DB isn't the relevant part, as Tribuo requires all the data be in memory before it can train a model.