Open juzzmac opened 4 months ago
Hi @juzzmac,
I had the same problem in Databricks. Is MLflow autologging on by any chance?
It seems that MLflow tries to load the dataset in memory for logging purposes, which is not possible for the endless stream that Petastorm generates when num_epochs
is not specified in make_tf_dataset
. Additionally, it can be very slow and prone to OOM errors even when num_epochs is defined.
Adding the following flag in the autologging call fixed it for me:
mlflow.tensorflow.autolog(log_datasets=False)
Hope this solves it!
I've tried several different versions of the following code, all of which work when running locally but hang forever in DataBricks (single node, 13.3 LTS ML runtime):