mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.84k stars 503 forks source link

Change TE docker image to enable te_shard_weight #1251

Closed j316chuck closed 1 month ago

j316chuck commented 1 month ago

Description

Change foundry docker images to use fork of TE that has prepare_te_modules_for_fsdp

Issues Fixed:

chuck-7b-starcoder2x-fp8-run1-ENdcdf fails with:

[rank0]: ImportError: cannot import name 'prepare_te_modules_for_fsdp' from 'transformer_engine.pytorch.distributed' (/usr/lib/python3/dist-packages/transformer_engine/pytorch/distributed.py)

when te_shard_weight: true

To fix, we need to use/pin a branch of TE that has this module in order to the 700 tflops numbers