mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.83k stars 502 forks source link

Add Retries to run_query #1302

Closed KuuCi closed 1 week ago

KuuCi commented 1 week ago

Fine-tuning has seen a couple of failures regarding Spark timing out, resulting in complete run failure. Just adding retries here to buffer and hopefully reduce that down.

https://databricks.atlassian.net/browse/MCLOUD-4793

Here is a passing run: test-uc-mlflow-66pcq4

KuuCi commented 1 week ago

Passing Run: test-uc-mlflow-WyKSz4