[dask] Dask LightGBM best practices

Hey! I've been experimenting a bit with Dask LightGBM and I have a couple questions. Apologies if this is not the right forum for questions like this.

1) First, what is the most recommended to pass data into LightGBM? Initially I followed the advice here [1] and persisted my data before fitting the DaskLGBMRegressor. Specifically, I would roughly use something like:

df_train = dask.dataframe.read_parquet("/my/glob/*/*.parquet", columns=features+[resp], engine="pyarrow")
X_train = df_train[features].to_dask_array(lengths=True)
y_train = df_train[resp].to_dask_array(lengths=True)
X_train, y_train = X_train.persist(), y_train.persist()
_ = dask.distributed.wait([X_train, y_train])

model = lgb.DaskLGBMRegressor(**blah)
model.fit(X_train, y_train)

But based on early observations I'm finding this to be slower than just dropping the persist and to_dask_array calls and giving LightGBM the read_parquet future. I don't really understand why persisting would be helpful anyway since from what I understand, LightGBM will first do some work to rearrange the data to ensure that X[i] and y[i] are on the same worker for all i [2], so doing any work to load data on the workers prior seems wasted? I.e. I now run something like this:

df_train = dask.dataframe.read_parquet("/my/glob/*/*.parquet", columns=features+[resp], engine="pyarrow")
model = lgb.DaskLGBMRegressor(**blah)
model.fit(df_train[features], df_train[resp])

2) I'm having trouble doing some kind of logging with Dask LightGBM. I'd be happy to know just how many iterations in training is, but passing in something like callbacks=[lambda env: print(env.iteration)] doesn't show anything in either scheduler or worker logs. Can I do some sort of logging in the dask regressor?

[1] https://github.com/jameslamb/lightgbm-dask-testing/blob/main/notebooks/demo-aws.ipynb [2] https://www.youtube.com/watch?v=XFVcoBu1rNw

microsoft / LightGBM

[dask] Dask LightGBM best practices #5840