[Hyperopt] Eval results not being surfaced in tune

ShreyaR commented 2 years ago

Context I’d seen a higher AUROC (> 0.8) on a dataset in the logs printed out by one specific trial as in screenshot 1, but in the overall summary for the Hyperopt experiment, the higher AUROC was not reproduced (in the output in screenshot 2). Note that both screenshots are from a separate experiment, I wanted to add them to give examples.

Findings

The overall “best trial” according to hyperopt has a highest AUC of 0.79912. Digging deeper into the training stats for this trial, all evaluations (both val and test) are < 0.8
I looked deeper into the training stats for all trials, and actually did find the trial that had AUROC > 0.8. The highest val AUROC i saw was 0.80087. However, the hyperopt stats for this trial do not have this AUROC listed, and the “best reported” metric for this trial is 0.797168.
The metric used to select the best trial and the metric reported in the overall summary is Val AUROC
For the actual best trial, both test and val AUROC were > 0.8

Screenshot 1

Screenshot 2

arnavgarg1 commented 2 years ago

@ShreyaR I started investigating this issue. To test, I'm using the equivalent 100MB dataset and running hyperopt with 8 trials. So far, I haven't seen this issue creep up and have been monitoring evaluation logging for each trial as well as the overall best trial. Is there anything I can do to potentially recreate this?

Here's something I noticed while playing around:

In the instance where the goal is set to minimize, if the value becomes too small, it seems to not pick the smallest value. To reproduce, run test_hyperopt_run_hyperopt within test_hyperopt.py with 3 epochs until the loss value is really small (almost 0). Have you seen this before?

I initially thought this might be because of a bit overflow, but the type for metric_score is float64 so I doubt that it's because of this. If I also understand correctly, this output is created by tune.run() automatically based on the logging verbosity, so not sure if there's something within Ray Tune that's buggy or causing this behavior since that's what returns both ExperimentAnalysis and the printed status messages.

ShreyaR commented 2 years ago

@arnavgarg1 this is an interesting find and seems like it might be an issue with tune. It is ray.tune() that maintains the state of overall trials and selects best trial based on reported metrics.

Re: not being able to reproduce the issue -- that's not unexpected. I'd observed this issue on a very large scale dataset after I'd been training for 4-5 hours.

It would make sense to hold off on running an experiment to be able to repro this until we have access to more cost-effective GPUs in order to run experiments.

ShreyaR commented 2 years ago

The issue @arnavgarg1 uncovered was fixed in https://github.com/ray-project/ray/pull/26943.

Next steps: Run a long running hyperopt experiment on a large dataset, and try reproing the issue.

ludwig-ai / ludwig

[Hyperopt] Eval results not being surfaced in tune #2143