Error when run torch.load(ckpt_path, map_location="cpu") - Githubissues

modular-ml / wrapyfi-examples_llama

Inference code for facebook LLaMA models with Wrapyfi support

GNU General Public License v3.0

130 stars 13 forks source link

Error when run torch.load(ckpt_path, map_location="cpu") #6

Closed elricwan closed 1 year ago

elricwan commented 1 year ago

Hi there,

I have downloaded the LLaMA models, but when I try to load the model, I got the error: RuntimeError: PytorchStreamReader failed reading file data/2: invalid header or archive is corrupted

My PyTorch version is 1.13.1. Has the model version updated? My download files look like this:

Screenshot 2023-05-03 at 11 26 08 AM

fabawi commented 1 year ago

I have the same Pytorch version (1.13.1++cu117) running with Python 3.8.12 on Ubuntu 20.04. You could check the model sizes on disk (consolidated.00.pth for 7B should be about 13.5 GB) to make sure your checkpoints are not corrupted. I doubt they updated their weights recently, otherwise, the changes would be reflected in the llama repository. Which GPU do you have?

elricwan commented 1 year ago

I use one NVIDIA GeForce RTX 3090, the memory is 24 G.

fabawi commented 1 year ago

My guess is that either your checkpoint files are corrupted, or you have a pytorch that is incompatible with your card. Check that you can run other scripts with your installed pytorch in the same env.

fabawi commented 1 year ago

I use one NVIDIA GeForce RTX 3090, the memory is 24 G.

I'm assuming you are still running two instances of the 7B model. If you run 1 (same as the original llama implementation), you'd need to set the number of wrapyfi devices to 0. If you are running multiple instances, does this occur on the device_idx 0 or 1?

elricwan commented 1 year ago

I use the code to run on my first GPU: CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir /checkpoints/7B --tokenizer_path /checkpoints/tokenizer.model --wrapyfi_device_idx 1 --wrapyfi_total_devices 2 If the code is good, then I guess my checkpoint files might be corrupted.

fabawi commented 1 year ago

did you change to the directory where the checkpoint is?

fabawi commented 1 year ago

also, notice that it is in a directory called checkpoints as indicated by /checkpoints/7B .. You can change -ckpt_dir and --tokenizer_path to wherever those files are on your system