Occasional crash in distributed training and prediction of DLT models due to pystan

uber / orbit

A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

https://orbit-ml.readthedocs.io/en/stable/

Other

1.87k stars 134 forks source link

Occasional crash in distributed training and prediction of DLT models due to pystan #787

Closed ggerogiokas closed 10 months ago

ggerogiokas commented 2 years ago

When I run roughly 10,000 different time series, I get runs crashing for various reasons. They typically only relate to two errors:

pickle data was truncated or Ran out of input.

Both seem to relate to stan compilation issues.

Any tips on how to avoid these issues.

I am running on ubuntu, with python 3.8 so don't think it's an OS issue.

edwinnglabs commented 2 years ago

Can you provide some data / object snapshot when the issue happen? @ggerogiokas

ggerogiokas commented 2 years ago

Hi @edwinnglabs

Managed to find a work around. Everytime the cluster starts up I run every flavour(DLT, ETS, LGT) of orbit model. That seems to cache all the stan models I need and there are no longer any stan compilation errors when I run multiple orbit models in parallel.

Now I have issues getting good cpu utilisation. But I guess I can mention that in a new issue.