iterative / studio-support

❓ DVC Studio Issues, Question, and Discussions
https://studio.iterative.ai
16 stars 1 forks source link

`dvclive` stops sending data to studio during training #93

Closed jordaneliastam closed 1 year ago

jordaneliastam commented 1 year ago

Every time I run dvc exp run, eventually I start to see the following warning:

WARNING:dvc_studio_client.post_live_metrics:Failed to post to Studio: Out of range float values are not JSON compliant

dvclive/studio never seems to recover and the data on studio stops updating. I am able to view the data in the DVC vscode extension however.

I am using a slurm cluster.

shcheklein commented 1 year ago

It's related to this https://stackoverflow.com/questions/38821132/bokeh-valueerror-out-of-range-float-values-are-not-json-compliant

Quick workaround is to cleanup data to not contain not a number values - NaN, Infinity.

In VS Code extension we utilize the JSON5 lib to parse JS-compatible JSONs with some additional stuff like NaN, etc.

I wonder if we could change the workload type to text and use on the backed something like JSON5 to parse it, to make it more lax. It might have implication down the road (e.g. the way we serialize things into DB - does it support NaNs, etc)? The way we return the results to FE @amritghimire do you know if we already support this a bit more lax format? E.g. what happens if there is a repo with a NaN in plots data?

Even if we decide to keep it strict, we should probably then detect early and have a proper message for this.

shcheklein commented 1 year ago

Also, @daavoo for visibility.

daavoo commented 1 year ago

Also, @daavoo for visibility.

As a quick solution, I am going to patch it on the DVCLive side by casting the invalid values to string

jordaneliastam commented 1 year ago

I was logging my gradient norms, and hadn't seen any issues in the dvclive plots, but after putting a torch.isfinite() guard I saw they were return inf occasionally. So, just confirming your assessment of the issue!

thanks!