Open hidoba opened 2 months ago
I've also observed this in the log:
04/01/2024 00:35:47 - WARNING - accelerate.utils.other - Removed shared tensor {'proj_out.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading
any updates on this ? I'm facing the same problem
Here's a temporary fix according to https://huggingface.co/docs/safetensors/torch_shared_tensors
Modify load_accelerator_state()
: https://github.com/huggingface/accelerate/blob/main/src/accelerate/checkpointing.py#L153
-from safetensors.torch import load_file
+from safetensors.torch import load_model
...
if input_model_file.exists():
- state_dict = load_file(input_model_file, device=str(map_location))
+ load_model(models[i], input_model_file, device=str(map_location), **load_model_func_kwargs)
else:
# Load with torch
input_model_file = input_dir.joinpath(f"{MODEL_NAME}{ending}.bin")
state_dict = torch.load(input_model_file, map_location=map_location)
- models[i].load_state_dict(state_dict, **load_model_func_kwargs)
+ models[i].load_state_dict(state_dict, **load_model_func_kwargs)
So I removed the flag
--overwrite_output_dir
to be able to resume the training, and I'm getting the following error:At the same time, evaluation script works just fine with the same checkpoint.
I'm using Ubuntu 22, rtx 3090 ti.