meta-llama / codellama

Inference code for CodeLlama models
Other
15.84k stars 1.84k forks source link

torchrun --nproc_per_node 2 example_instructions.py --ckpt_dir CodeLlama-13b-Instruct/ --tokenizer_path CodeLlama-13b-Instruct/tokenizer.model --max_seq_len 8192 --max_batch_size 4 #60

Open alvynabranches opened 1 year ago

alvynabranches commented 1 year ago

WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


initializing model parallel with size 2 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "/home/azureuser/codellama/example_instructions.py", line 68, in fire.Fire(main) File "/home/azureuser/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/azureuser/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/azureuser/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "/home/azureuser/codellama/example_instructions.py", line 20, in main generator = Llama.build( File "/home/azureuser/codellama/llama/generation.py", line 90, in build checkpoint = torch.load(ckpt_path, map_location="cpu") File "/home/azureuser/.local/lib/python3.10/site-packages/torch/serialization.py", line 815, in load return _legacy_load(opened_file, map_location, pickle_module, pickle_load_args) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/serialization.py", line 1033, in _legacy_load magic_number = pickle_module.load(f, *pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'. Traceback (most recent call last): File "/home/azureuser/codellama/example_instructions.py", line 68, in fire.Fire(main) File "/home/azureuser/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/azureuser/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/azureuser/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(varargs, **kwargs) File "/home/azureuser/codellama/example_instructions.py", line 20, in main generator = Llama.build( File "/home/azureuser/codellama/llama/generation.py", line 75, in build torch.cuda.set_device(local_rank) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/cuda/init.py", line 350, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 14881) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/azureuser/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_instructions.py FAILED

Failures: [1]: time : 2023-08-29_13:34:23 host : llm.internal.cloudapp.net rank : 1 (local_rank: 1) exitcode : 1 (pid: 14882) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-08-29_13:34:23 host : llm.internal.cloudapp.net rank : 0 (local_rank: 0) exitcode : 1 (pid: 14881) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

mhamra commented 1 year ago

I have the same problem. I reported it in issue #55.

There's something wrong with loading a file ...

File "/home/azureuser/.local/lib/python3.10/site-packages/torch/serialization.py", line 1033, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
GaganHonor commented 1 year ago

It seems that you are encountering some errors related to downloading and running code. Here are a few suggestions to help you troubleshoot the issue:

OMP_NUM_THREADS environment variable: The warning message indicates that the OMP_NUM_THREADS environment variable is being set to 1. This is done to avoid overloading your system. However, you might need to further tune this variable for optimal performance in your application. You can try adjusting the value of OMP_NUM_THREADS to see if it resolves the issue.

Invalid load key error: The traceback shows an error related to loading a checkpoint file. The error message suggests that there is an invalid load key ('<') in the checkpoint file. This could indicate that the file is corrupted or incompatible with the version of the code you are using. You may need to ensure that the checkpoint file is valid and compatible.

CUDA error: invalid device ordinal: Another error in the traceback suggests a CUDA error related to an invalid device ordinal. This error typically occurs when the code is trying to access a CUDA device that doesn't exist. Make sure that you have the correct CUDA device configured and that it is accessible.

Compile with TORCH_USE_CUDA_DSA: The last error message mentions compiling with TORCH_USE_CUDA_DSA to enable device-side assertions. This is an advanced option related to CUDA programming. If you are not familiar with CUDA programming or don't specifically require device-side assertions, you can ignore this error message.

In summary, try adjusting the OMP_NUM_THREADS environment variable, ensuring the checkpoint file is valid, checking your CUDA device configuration, and ignoring the TORCH_USE_CUDA_DSA error if you are not working with CUDA programming. If the issue persists, you may need to provide more specific details or seek assistance from the code's documentation or support channels.