ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.09k stars 1.19k forks source link

[Hyperopt] Eval results not being surfaced in tune #2143

Open ShreyaR opened 2 years ago

ShreyaR commented 2 years ago

Context I’d seen a higher AUROC (> 0.8) on a dataset in the logs printed out by one specific trial as in screenshot 1, but in the overall summary for the Hyperopt experiment, the higher AUROC was not reproduced (in the output in screenshot 2). Note that both screenshots are from a separate experiment, I wanted to add them to give examples.

Findings

Screenshot 1

Screen Shot #1

Screenshot 2

Screen Shot #2
arnavgarg1 commented 2 years ago

@ShreyaR I started investigating this issue. To test, I'm using the equivalent 100MB dataset and running hyperopt with 8 trials. So far, I haven't seen this issue creep up and have been monitoring evaluation logging for each trial as well as the overall best trial. Is there anything I can do to potentially recreate this?

Here's something I noticed while playing around:

In the instance where the goal is set to minimize, if the value becomes too small, it seems to not pick the smallest value. To reproduce, run test_hyperopt_run_hyperopt within test_hyperopt.py with 3 epochs until the loss value is really small (almost 0). Have you seen this before?

Screen Shot 2022-06-22 at 2 43 23 PM

I initially thought this might be because of a bit overflow, but the type for metric_score is float64 so I doubt that it's because of this. If I also understand correctly, this output is created by tune.run() automatically based on the logging verbosity, so not sure if there's something within Ray Tune that's buggy or causing this behavior since that's what returns both ExperimentAnalysis and the printed status messages.

ShreyaR commented 2 years ago

@arnavgarg1 this is an interesting find and seems like it might be an issue with tune. It is ray.tune() that maintains the state of overall trials and selects best trial based on reported metrics.

Re: not being able to reproduce the issue -- that's not unexpected. I'd observed this issue on a very large scale dataset after I'd been training for 4-5 hours.

It would make sense to hold off on running an experiment to be able to repro this until we have access to more cost-effective GPUs in order to run experiments.

ShreyaR commented 2 years ago

The issue @arnavgarg1 uncovered was fixed in https://github.com/ray-project/ray/pull/26943.

Next steps: Run a long running hyperopt experiment on a large dataset, and try reproing the issue.