Open alvynabranches opened 1 year ago
I have the same problem. I reported it in issue #55.
There's something wrong with loading a file ...
File "/home/azureuser/.local/lib/python3.10/site-packages/torch/serialization.py", line 1033, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '<'.
It seems that you are encountering some errors related to downloading and running code. Here are a few suggestions to help you troubleshoot the issue:
OMP_NUM_THREADS environment variable: The warning message indicates that the OMP_NUM_THREADS environment variable is being set to 1. This is done to avoid overloading your system. However, you might need to further tune this variable for optimal performance in your application. You can try adjusting the value of OMP_NUM_THREADS to see if it resolves the issue.
Invalid load key error: The traceback shows an error related to loading a checkpoint file. The error message suggests that there is an invalid load key ('<') in the checkpoint file. This could indicate that the file is corrupted or incompatible with the version of the code you are using. You may need to ensure that the checkpoint file is valid and compatible.
CUDA error: invalid device ordinal: Another error in the traceback suggests a CUDA error related to an invalid device ordinal. This error typically occurs when the code is trying to access a CUDA device that doesn't exist. Make sure that you have the correct CUDA device configured and that it is accessible.
Compile with TORCH_USE_CUDA_DSA: The last error message mentions compiling with TORCH_USE_CUDA_DSA to enable device-side assertions. This is an advanced option related to CUDA programming. If you are not familiar with CUDA programming or don't specifically require device-side assertions, you can ignore this error message.
In summary, try adjusting the OMP_NUM_THREADS environment variable, ensuring the checkpoint file is valid, checking your CUDA device configuration, and ignoring the TORCH_USE_CUDA_DSA error if you are not working with CUDA programming. If the issue persists, you may need to provide more specific details or seek assistance from the code's documentation or support channels.
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 14881) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/azureuser/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
example_instructions.py FAILED
Failures: [1]: time : 2023-08-29_13:34:23 host : llm.internal.cloudapp.net rank : 1 (local_rank: 1) exitcode : 1 (pid: 14882) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-08-29_13:34:23 host : llm.internal.cloudapp.net rank : 0 (local_rank: 0) exitcode : 1 (pid: 14881) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html