Open adfea9c0 opened 1 year ago
Tangentially I'm wondering if there is some way to affect load balancing? I think either Dask or LightGBM is currently causing a very unbalanced distribution of data for my use case.
I'm loading 500 parquet files of about 1GB each onto 160 workers each with 50GB of memory. I imagine that would be plenty but some workers get very little data while others get so much it spills to disk. How does this happen? How can I fix it ?
Hey! I've been experimenting a bit with Dask LightGBM and I have a couple questions. Apologies if this is not the right forum for questions like this.
1) First, what is the most recommended to pass data into LightGBM? Initially I followed the advice here [1] and persisted my data before fitting the DaskLGBMRegressor. Specifically, I would roughly use something like:
But based on early observations I'm finding this to be slower than just dropping the
persist
andto_dask_array
calls and giving LightGBM theread_parquet
future. I don't really understand why persisting would be helpful anyway since from what I understand, LightGBM will first do some work to rearrange the data to ensure thatX[i]
andy[i]
are on the same worker for alli
[2], so doing any work to load data on the workers prior seems wasted? I.e. I now run something like this:2) I'm having trouble doing some kind of logging with Dask LightGBM. I'd be happy to know just how many iterations in training is, but passing in something like
callbacks=[lambda env: print(env.iteration)]
doesn't show anything in either scheduler or worker logs. Can I do some sort of logging in the dask regressor?[1] https://github.com/jameslamb/lightgbm-dask-testing/blob/main/notebooks/demo-aws.ipynb [2] https://www.youtube.com/watch?v=XFVcoBu1rNw