mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
2.99k stars 400 forks source link

How to use large datasets? (Dask?) #477

Open fernaper opened 2 years ago

fernaper commented 2 years ago

Hi, I'm trying to train models with really large Datasets, up to 100Gb.

Is MLJAR integrated with Dask? if so, do you know how? If not, how can I parallelize or handle this kind of datasets?

Best regards

pplonski commented 2 years ago

Hi @fernaper,

MLJAR is not integrated with Dask. I've never used Dask. I don't know how to do the integration - will need to check.

What I do in such cases, is using machine with a lot of RAM (in the cloud). But first I would downsample the dataset to have proof of concept that ML works on the data.