Closed qmin2 closed 2 weeks ago
Hi @qmin2 - the error doesn't look to come from DeepSpeed, the underlying error is an nvcc_fatal here:
nvcc fatal : Unsupported NVHPC compiler found. nvc++ is the only NVHPC compiler that is supported.
I believe this means you either have an outdated nvcc or nvc++ so you should try to update those and then run again. If you need, please share the versions of both of those.
@qmin2 - closing for now since this doesn't appear to be a DeepSpeed issue to me. If you are still having problems after resolving the nvhpc compiler issue, please comment here and we can re-open this issue as well. Thanks!
I encountered multiple issues while trying to perform full fine-tuning of the LLaMA 3 8B model with DeepSpeed with A100-80GB x 2.
As a result, I decided to follow the DeepSpeed tutorial on Huggingface.
Below is the command I used, which closely follows the example in the tutorial:
And this is ds_config_zero3.json
Then I got this error
For your information I'm using slurm cluster interactive mode.
GPU: A100-80GB x 2 gcc --version : 12.2.0 nvcc --version : 11.8 nvc++ --version: nvc++ 22.2-0 64-bit target on x86-64 Linux -tp zen3
nvidia-smi shows NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1
This is pip list
I spent lots of time handling this issue... Is there any solution for this?