[BUG] No such file or directory adapter_model.bin when train with validation dataset

Prerequisites

[X] I have read the documentation.
[X] I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

nohup autotrain llm \ --train \ --model 'meta-llama/Meta-Llama-3-70B-Instruct' \ --project-name 'Llama-3-70B-Instruct-001' \ --data-path '/opt/dataset' \ --train-split 'train' \ --valid-split 'validation' \ --epochs 8 \ --lr 2e-4 \ --text-column text \ --peft \ --eval-strategy epoch \ --train-batch-size 1 \ --mixed-precision fp16 \ --quantization int4 \ --trainer sft \ --merge-adapter \ --use-flash-attention-2 &

UI Screenshots & Parameters

No response

Error Logs

100%|██████████| 184/184 [23:20<00:00, 5.70s/it]Loading best peft model from Llama-3-70B-Instruct-001/checkpoint-69 (score: 0.43450450897216797). Loading best peft model from Llama-3-70B-Instruct-001/checkpoint-69 (score: 0.43450450897216797). ERROR | 2024-07-01 16:05:55 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last): File "/opt/anaconda3/envs/train-env/lib/python3.11/site-packages/autotrain/trainers/common.py", line 117, in wrapper return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/train-env/lib/python3.11/site-packages/autotrain/trainers/clm/main.py", line 28, in train train_sft(config) File "/opt/anaconda3/envs/train-env/lib/python3.11/site-packages/autotrain/trainers/clm/train_clm_sft.py", line 56, in train trainer.train() File "/opt/anaconda3/envs/train-env/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 440, in train output = super().train(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/train-env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/train-env/lib/python3.11/site-packages/transformers/trainer.py", line 2427, in _inner_training_loop self.control = self.callback_handler.on_train_end(args, self.state, self.control) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/train-env/lib/python3.11/site-packages/transformers/trainer_callback.py", line 464, in on_train_end return self.call_event("on_train_end", args, state, control) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/train-env/lib/python3.11/site-packages/transformers/trainer_callback.py", line 508, in call_event result = getattr(callback, event)( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/train-env/lib/python3.11/site-packages/autotrain/trainers/clm/callbacks.py", line 36, in on_train_end adapters_weights = torch.load(best_model_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/train-env/lib/python3.11/site-packages/torch/serialization.py", line 997, in load with _open_file_like(f, 'rb') as opened_file: ^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/train-env/lib/python3.11/site-packages/torch/serialization.py", line 444, in _open_file_like return _open_file(name_or_buffer, mode) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/anaconda3/envs/train-env/lib/python3.11/site-packages/torch/serialization.py", line 425, in init super().init(open(name, mode)) ^^^^^^^^^^^^^^^^ FileNotFoundError: [Errno 2] No such file or directory: 'Llama-3-70B-Instruct-001/checkpoint-69/adapter_model.bin'

ERROR | 2024-07-01 16:05:55 | autotrain.trainers.common:wrapper:121 - [Errno 2] No such file or directory: 'Llama-3-70B-Instruct-001/checkpoint-69/adapter_model.bin' 100%|██████████| 184/184 [23:23<00:00, 7.63s/it] INFO | 2024-07-01 16:05:59 | autotrain.cli.run_llm:run:350 - Job ID: 6159

Additional Information

When I list the contents of the checkpoint directory, I obtain the following files:

(train-env) azureuser@llama-3-70b-dev:/opt/huggingface/hub$ ls -altrh Llama-3-70B-Instruct-001/checkpoint-69/ total 2.4G -rw-rw-r-- 1 azureuser azureuser 15K Jul 1 15:50 rng_state_1.pth -rw-rw-r-- 1 azureuser azureuser 50K Jul 1 15:50 tokenizer_config.json -rw-rw-r-- 1 azureuser azureuser 325 Jul 1 15:50 special_tokens_map.json -rw-rw-r-- 1 azureuser azureuser 8.7M Jul 1 15:50 tokenizer.json -rw-rw-r-- 1 azureuser azureuser 5.4K Jul 1 15:50 training_args.bin -rw-rw-r-- 1 azureuser azureuser 2.1K Jul 1 15:50 trainer_state.json -rw-rw-r-- 1 azureuser azureuser 1.1K Jul 1 15:50 scheduler.pt -rw-rw-r-- 1 azureuser azureuser 15K Jul 1 15:50 rng_state_0.pth -rw-rw-r-- 1 azureuser azureuser 1.6G Jul 1 15:50 optimizer.pt drwxrwxr-x 2 azureuser azureuser 4.0K Jul 1 15:50 . -rw-rw-r-- 1 azureuser azureuser 5.0K Jul 1 15:50 README.md -rw-rw-r-- 1 azureuser azureuser 791M Jul 1 15:50 adapter_model.safetensors -rw-rw-r-- 1 azureuser azureuser 888 Jul 1 15:50 pytorch_model.bin -rw-rw-r-- 1 azureuser azureuser 739 Jul 1 15:50 adapt

huggingface / autotrain-advanced