mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.99k stars 525 forks source link

Add CLI for train.py #1337

Closed KuuCi closed 2 months ago

KuuCi commented 3 months ago

This PR allows users to call composer llm-foundry train {YAML_PATH} {ARGS} while maintaining correctness with composer llm-foundry/train.py {PATH} {ARGS}. The motivation is for DLE where we want to make the CLI much more intuitive in the docker images

Testing: test-cli-cSn2Rb runs: composer -c -n 8 llmfoundry train /mnt/config/parameters.yaml || (echo "Command failed - killing python" && pkill python && exit 1)

test-cli-qsRHEI runs: composer -c llmfoundry train /mnt/config/parameters.yaml || (echo "Command failed - killing python" && pkill python && exit 1)

test-cli-vGpXcw runs: composer train/train.py /mnt/config/parameters.yaml || (echo "Command failed - killing python" && pkill python && exit 1)

Here is the MLflow experiement folder indicating all three runs act the same: https://dbc-04ac0685-8857.staging.cloud.databricks.com/ml/experiments/3707544126254710?o=3360802220363900&searchFilter=&orderByKey=attributes.start_time&orderByAsc=false&startTime=ALL&lifecycleFilter=Active&modelVersionFilter=All+Runs&datasetsFilter=W10%3D

b-chu commented 2 months ago

This seems like a breaking change, do we have a deprecation plan for existing mcli yamls? I think a lot of people call composer scripts/train/train.py right now

KuuCi commented 2 months ago

We aren't deleting scripts/train/train.py, scripts/train/train.py is just calling train/train.py now. Here is a run showing that the existing workflow still works: test-cli-ZzkqPt runs: composer train/train.py /mnt/config/parameters.yaml || (echo "Command failed - killing python" && pkill python && exit 1)

image
b-chu commented 2 months ago

Ah, thanks for pointing that out. I'll give a more detailed review later

KuuCi commented 2 months ago

will update to match scripts/train/train.py merges after first pass

KuuCi commented 2 months ago

manual test runs updated