lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.63k stars 4.52k forks source link

[Fine-Tuning Fail]: Problem Running FastChat T5 fine-tuning #2465

Open pcchen-ntunlp opened 1 year ago

pcchen-ntunlp commented 1 year ago

I'm attempting to fine-tuning FastChat T5 locally using the command:

torchrun --nproc_per_node=1 --master_port=9778 fastchat/train/train_flant5.py \ --model_name_or_path {my_path}/test_fastchat/fastchat-t5-3b-v1.0 \ --data_path ./data/dummy_conversation.json \ --bf16 True \ --output_dir ./checkpoints_flant5_3b \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 300 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap T5Block \ --tf32 True \ --model_max_length 2048 \ --preprocessed_path ./preprocessed_data/processed.json \ --gradient_checkpointing True

However, during the execution, I encounter the following traceback:

WARNING:root:Loading data... WARNING:root:Formatting inputs... Traceback (most recent call last): File "{my_path}/test_fastchat/FastChat/fastchat/train/train_flant5.py", line 436, in train() File "{my_path}/test_fastchat/FastChat/fastchat/train/train_flant5.py", line 422, in train data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "{my_path}/test_fastchat/FastChat/fastchat/train/train_flant5.py", line 383, in make_supervised_data_module train_dataset = dataset_cls( ^^^^^^^^^^^^ File "{my_path}/test_fastchat/FastChat/fastchat/train/train_flant5.py", line 301, in init data_dict = preprocess(sources, tokenizer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "{my_path}/test_fastchat/FastChat/fastchat/train/train_flant5.py", line 234, in preprocess header = f"{default_conversation.system}\n\n" ^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'Conversation' object has no attribute 'system' WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2294534 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2294533) of binary: {my_home}/anaconda3/envs/fast-chat/bin/python Traceback (most recent call last): File "{my_home}/anaconda3/envs/fast-chat/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "{my_home}/anaconda3/envs/fast-chat/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "{my_home}/anaconda3/envs/fast-chat/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main run(args) File "{my_home}/anaconda3/envs/fast-chat/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "{my_home}/anaconda3/envs/fast-chat/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "{my_home}/anaconda3/envs/fast-chat/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Environment:

OS: Ubuntu 22.04.2 LTS Device: NVIDIA RTX A6000 PyTorch version: 2.0.1 CUDA version: 11.7

Expected Behavior:

The training process should execute without raising any AttributeError.

Actual Behavior:

The training halts due to the AttributeError related to the missing 'system' attribute in the Conversation object.

Additional Context:

This is my first attempt to train FastChat T5 on my local machine, and I followed the setup instructions as provided in the documentation. It's important to note that I have not made any modifications to any files and am just attempting to run the code to see if it can execute successfully.

How should I go about resolving this issue?

Memelank commented 11 months ago

Same. Anyone can help?