Open mroberto166 opened 1 month ago
I have met similar issue when trying to use load_best_model_at_end in training fsdp im multi-node multi-gpu. The worker porcess is trying to locate pytorch_model_fsdp but it is only saved on master process.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Hi everyone, I am recently getting the following error when I try to load a model that has been previously trained with FSDP and SHARDED_STATE_DICT:
0: Traceback (most recent call last): 0: File "/model/backbone/train.py", line 781, in <module> 0: main() 0: File "/model/backbone/train.py", line 751, in main 0: trainer = Trainer( 0: ^^^^^^^^ 0: File "/model/backbone/train.py", line 291, in __init__ 0: self.checkpointer.load_checkpoint( 0: File "/model/checkpointing/checkpointer.py", line 170, in load_checkpoint 0: self.accelerator.load_state(input_dir=node_path) 0: File "/usr/local/lib/python3.11/site-packages/accelerate/accelerator.py", line 3084, in load_state 0: load_fsdp_model(self.state.fsdp_plugin, self, model, input_dir, i) 0: File "/usr/local/lib/python3.11/site-packages/accelerate/utils/fsdp_utils.py", line 146, in load_fsdp_model 0: dist_cp.load_state_dict( 0: File "/usr/local/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 31, in load_state_dict 0: return _load_state_dict(state_dict, storage_reader, process_group, coordinator_rank, no_dist, planner) 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0: File "/usr/local/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 164, in _load_state_dict 0: central_plan = distW.reduce_scatter("plan", local_step, global_step) 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0: File "/usr/local/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 200, in reduce_scatter 0: raise result 0: torch.distributed.checkpoint.api.CheckpointException: CheckpointException ranks:dict_keys([8, 9, 10, 11, 12, 13, 14, 15]) 0: Traceback (most recent call last): (RANK 8) 0: File "/usr/local/lib/python3.11/site-packages/torch/distributed/checkpoint/utils.py", line 173, in reduce_scatter 0: local_data = map_fun() 0: ^^^^^^^^^ 0: File "/usr/local/lib/python3.11/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 150, in local_step 0: metadata = storage_reader.read_metadata() 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0: File "/usr/local/lib/python3.11/site-packages/torch/distributed/checkpoint/filesystem.py", line 497, in read_metadata 0: with (self.path / ".metadata").open("rb") as metadata_file: 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0: File "/usr/local/lib/python3.11/pathlib.py", line 1044, in open 0: return io.open(self, mode, buffering, encoding, errors, newline) 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/fast/slurmwork/runs/2024-08-13-causal-husky-1/loading-checkpoint/node-1/pytorch_model_fsdp_0/.metadata' 0: Traceback (most recent call last): (RANK 9)
This is the content of the machine folder/mnt/fast/slurmwork/runs/2024-08-13-causal-husky-1/loading-checkpoint/node-1/pytorch_model_fsdp_0/
:and this is the content of the same folder on the first node:
As you can see the .metadata misses in the second node folder, while is present in the first. The mode is saved with
acclerate.save_state(dir)
and loaded withaccelerate.load_state()
:Expected behavior
I would expect the code to find the correct files or not look for non-existing ones