ImportError in LLaMA Training Script

viai957 commented 1 week ago

When attempting to run the training script for LLaMA with the following command: CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh an ImportError is encountered. The specific error message is: ImportError: cannot import name 'Partial' from 'torch.distributed._tensor' (/apps/torchtitan/torchtitan/lib/python3.10/site-packages/torch/distributed/_tensor/__init__.py)

The training script should start without any import errors and utilize the specified configuration file to train the model across 8 GPUs.

The script fails to run due to an ImportError indicating that Partial cannot be imported from torch.distributed._tensor. The error traceback is as follows: Traceback (most recent call last): File "/apps/torchtitan/train.py", line 34, in <module> from torchtitan.models import model_name_to_cls, model_name_to_tokenizer, models_config File "/apps/torchtitan/torchtitan/models/__init__.py", line 7, in <module> from torchtitan.models.llama import llama2_configs, llama3_configs, Transformer File "/apps/torchtitan/torchtitan/models/llama/__init__.py", line 10, in <module> from torchtitan.models.llama.model import ModelArgs, Transformer File "/apps/torchtitan/torchtitan/models/llama/model.py", line 17, in <module> from torchtitan.models.norms import create_norm File "/apps/torchtitan/torchtitan/models/norms.py", line 17, in <module> from torch.distributed._tensor import Partial, Replicate, Shard ImportError: cannot import name 'Partial' from 'torch.distributed._tensor' (/apps/torchtitan/torchtitan/lib/python3.10/site-packages/torch/distributed/_tensor/__init__.py)

kwen2501 commented 1 week ago

Partial used to be named as _Partial. That is, it was recently made public. You can upgrade your PyTorch version to pass this import error. Sorry about the break.

viai957 commented 1 week ago

I did try to upgrade the PyTorch to torch >= 2.3.1 but still the problem persists `

export USE_LIBUV=1
USE_LIBUV=1
TRAINER_DIR=/home/vignesh/local/torchtitan
NGPU=8
LOG_RANK=0
CONFIG_FILE=./train_configs/llama3_8b.toml
overrides=
'[' 0 -ne 0 ']'
torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/llama3_8b.toml W0621 07:06:22.450000 140470868342592 torch/distributed/run.py:757] W0621 07:06:22.450000 140470868342592 torch/distributed/run.py:757] W0621 07:06:22.450000 140470868342592 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0621 07:06:22.450000 140470868342592 torch/distributed/run.py:757] [rank0]:Traceback (most recent call last): [rank0]: File "/apps/torchtitan/train.py", line 34, in [rank0]: from torchtitan.models import model_name_to_cls, model_name_to_tokenizer, models_config [rank0]: File "/apps/torchtitan/torchtitan/models/init.py", line 7, in [rank0]: from torchtitan.models.llama import llama2_configs, llama3_configs, Transformer [rank0]: File "/apps/torchtitan/torchtitan/models/llama/init.py", line 10, in [rank0]: from torchtitan.models.llama.model import ModelArgs, Transformer [rank0]: File "/apps/torchtitan/torchtitan/models/llama/model.py", line 17, in [rank0]: from torchtitan.models.norms import create_norm [rank0]: File "/apps/torchtitan/torchtitan/models/norms.py", line 17, in [rank0]: from torch.distributed._tensor import Partial, Replicate, Shard [rank0]:ImportError: cannot import name 'Partial' from 'torch.distributed._tensor' (/home/vignesh/.local/lib/python3.10/site-packages/torch/distributed/_tensor/init.py) E0621 07:06:28.036000 140470868342592 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 68097) of binary: /opt/conda/bin/python3.10 Traceback (most recent call last): File "/home/vignesh/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/vignesh/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/home/vignesh/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/vignesh/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/vignesh/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/vignesh/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

exitcode : 1 (pid: 68102) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2024-06-21_07:06:27 host : llama3.trustt.com rank : 6 (local_rank: 6) exitcode : 1 (pid: 68103) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [7]: time : 2024-06-21_07:06:27 host : llama3.trustt.com rank : 7 (local_rank: 7) exitcode : 1 (pid: 68104) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-06-21_07:06:27 host : llama3.trustt.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 68097) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html `

awgu commented 1 week ago

@viai957 I think unfortunately you need to use a nightly release, not simply torch >= 2.3.1. The challenge is that much of the code here in torchtitan is relying on changing code in the pytorch repo, so torchtitan generally requires a nightly version.

See the Preview (Nightly) option in https://pytorch.org/get-started/locally/.

pytorch / torchtitan

ImportError in LLaMA Training Script #412