togethercomputer / OpenChatKit

Apache License 2.0
9k stars 1.01k forks source link

Cupy error while training (`CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal`) #52

Closed orangetin closed 1 year ago

orangetin commented 1 year ago

Describe the bug The bash script to train the model does not work because of a Cupy error:

(OpenChatKit-Test) user@pc:~/OpenChatKit$ bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh
Traceback (most recent call last):
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
Traceback (most recent call last):
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
    main()
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
    init_communicators(args)
  File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
Traceback (most recent call last):
    _PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
  File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
Traceback (most recent call last):
    cupy.cuda.Device(cuda_id).use()
  File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
  File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
    main()
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
  File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
    init_communicators(args)
  File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
    _PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
  File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
  File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
    cupy.cuda.Device(cuda_id).use()
  File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
  File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
  File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
  File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
Traceback (most recent call last):
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
    main()
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
    main()
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
    init_communicators(args)
  File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
    init_communicators(args)
  File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
    _PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
  File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
    _PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
  File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
    cupy.cuda.Device(cuda_id).use()
    cupy.cuda.Device(cuda_id).use()
  File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
  File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
  File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
  File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
  File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
  File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
  File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
  File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
Traceback (most recent call last):
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
    main()
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
    init_communicators(args)
  File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
    _PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
  File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
    cupy.cuda.Device(cuda_id).use()
  File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
  File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
  File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
  File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
    main()
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
    init_communicators(args)
  File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
    _PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
  File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
    cupy.cuda.Device(cuda_id).use()
  File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
  File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
  File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
  File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0
Traceback (most recent call last):
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 358, in <module>
    main()
  File "/home/user/OpenChatKit/training/dist_clm_train.py", line 275, in main
    init_communicators(args)
  File "/home/user/OpenChatKit/training/comm/comm_utils.py", line 103, in init_communicators
    _PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
  File "/home/user/OpenChatKit/training/comm/nccl_backend.py", line 31, in __init__
    cupy.cuda.Device(cuda_id).use()
  File "cupy/cuda/device.pyx", line 196, in cupy.cuda.device.Device.use
  File "cupy/cuda/device.pyx", line 222, in cupy.cuda.device.Device.use
  File "cupy_backends/cuda/api/runtime.pyx", line 365, in cupy_backends.cuda.api.runtime.setDevice
  File "cupy_backends/cuda/api/runtime.pyx", line 142, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal

To Reproduce Steps to reproduce the behavior:

  1. Run code on WSL-Ubuntu in a Conda Env
  2. Run the bash script bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh
  3. The error above is produced

Expected behavior The code is supposed to execute.

Screenshots NA

Desktop (please complete the following information):

Additional context Also, the previous steps to download the data and weights also gave me errors. These steps:

python data/OIG/prepare.py
python pretrained/GPT-NeoX-20B/prepare.py

Ended after a couple minutes/hours with the error message "Killed". I was able to acquire the data sets with a simple wget command but I thought that was weird too.

orangetin commented 1 year ago

Update: I was able to fix this particular error by limiting the script to just one pipeline. And I was also able to run ~python data/OIG/prepare.py~ pretrained/GPT-NeoX-20B/prepare.py by forcing it to use hard-disk instead of GPU or CPU if memory is limited. I will share this fix a little later when I figure out how to run the rest of the scripts as it may help others run this on lower-end hardware.

I believe this script can be tweaked to run on computers with lower minimum requirements than the script currently requires but this will need further investigation. I will be looking into this and will post an update soon.

But FOR NOW, the script crashes with just the message "Killed" and the line number in the bash script. I was able to trace the error back to somewhere b/w lines 163-223 in training/pipeline_parallel/dist_gpipe_pipeline_async.py

I will investigate this further and report back. In the mean time, if anybody knows what's going on with this, I'd appreciate the help.

csris commented 1 year ago

Looks like the Nvidia GeForce 3060 has either 12GB or 8GB of VRAM. Unfortunately, I don't think you'll be able to train on this card. I think we normally require 8x A100 80GB GPUs to do a full training. @LorrinWWW, do you have any advice for training on lower-end hardware?

Also, @orangetin, can you tell me more about your fix for data/OIG/prepare.py? All it does is download data using git lfs and unzip files using the standard library's gzip. It shouldn't be touching the GPU.

Thank you for the detailed bug report! The details were very helpful.

orangetin commented 1 year ago

@csris correction: I fixed pretrained/GPT-NeoX-20B/prepare.py not data/OIG/prepare.py. data/OIG/prepare.py ran just fine for me.

Every time I ran pretrained/GPT-NeoX-20B/prepare.py, it ran for a couple minutes and just gave the output "Killed". I figured out the issue, my computer was out of memory. I was able to trace the error back to line 27: model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16)

AutoModelForCausalLM.from_pretrained can accept two more arguments: device_map="auto", offload_folder="SOME_FOLDER". This forces transformers to use the hard disk as cache storage when the RAM isn't enough.

So, to anyone trying to run this on lower-end hardware with not enough RAM, change line 27 of pretrained/GPT-NeoX-20B/prepare.py, to model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder="SOME_FOLDER") and replace SOME_FOLDER with an existing but empty directory.

I'm currently trying to get training running. I was able to go through 3 layers of training before it crashed (out of memory). Tweaking the PyTorch configuration should eliminate this issue. I have traced back the source of the crash, I'll report back when it works. I don't believe the minimum requirements listed should be quite this high, granted the code for bot.py does seem bloated.

Unfortunately, I'm a college student and, as of right now, can't afford 8x A100 80GB GPUs, but I'm determined to make this work XD. I was able to run pretty large models on just a CPU so I think this should be possible with the GeForce.

csris commented 1 year ago

@orangetin, are you on our Discord server (https://discord.gg/7fDdZNwA)? I'd like to chat with you about this effort.

csris commented 1 year ago

So, to anyone trying to run this on lower-end hardware with not enough RAM, change line 27 of pretrained/GPT-NeoX-20B/prepare.py, to model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder="SOME_FOLDER") and replace SOME_FOLDER with an existing but empty directory.

Makes sense and might be a good change even on system with a lot of RAM. Mind submitting a PR? I'll try it out on one of our machines.

I'm currently trying to get training running. I was able to go through 3 layers of training before it crashed (out of memory). Tweaking the PyTorch configuration should eliminate this issue. I have traced back the source of the crash, I'll report back when it works. I don't believe the minimum requirements listed should be quite this high, granted the code for bot.py does seem bloated.

That's really impressive! If you get this working, definitely mention this in the #openchatkit channel on the Discord server. There have been lots of people trying to make this work on lower-end hardware.

orangetin commented 1 year ago

Makes sense and might be a good change even on system with a lot of RAM. Mind submitting a PR? I'll try it out on one of our machines.

Yup, I'll submit the PR soon.

are you on our Discord server (https://discord.gg/7fDdZNwA)? I'd like to chat with you about this effort.

I'm on the server, I can send you a message.

Ruka-2019 commented 1 year ago

@csris correction: I fixed pretrained/GPT-NeoX-20B/prepare.py not data/OIG/prepare.py. data/OIG/prepare.py ran just fine for me.

Every time I ran pretrained/GPT-NeoX-20B/prepare.py, it ran for a couple minutes and just gave the output "Killed". I figured out the issue, my computer was out of memory. I was able to trace the error back to line 27: model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16)

AutoModelForCausalLM.from_pretrained can accept two more arguments: device_map="auto", offload_folder="SOME_FOLDER". This forces transformers to use the hard disk as cache storage when the RAM isn't enough.

So, to anyone trying to run this on lower-end hardware with not enough RAM, change line 27 of pretrained/GPT-NeoX-20B/prepare.py, to model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder="SOME_FOLDER") and replace SOME_FOLDER with an existing but empty directory.

I'm currently trying to get training running. I was able to go through 3 layers of training before it crashed (out of memory). Tweaking the PyTorch configuration should eliminate this issue. I have traced back the source of the crash, I'll report back when it works. I don't believe the minimum requirements listed should be quite this high, granted the code for bot.py does seem bloated.

Unfortunately, I'm a college student and, as of right now, can't afford 8x A100 80GB GPUs, but I'm determined to make this work XD. I was able to run pretty large models on just a CPU so I think this should be possible with the GeForce.

I met this error and I change line 27 of pretrained/GPT-NeoX-20B/prepare.py, to model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder="SOME_FOLDER"), also running on WSL. The result for me is that some of the pt files are broken (i.e: pytorch_lm_head.pt) using this way so I also fail while run training. A work around solution for me is to run this prepare.py in Windows to download the pt files, then move them to my WSL. With 32GB RAM I am able to run the script.

AndyInAi commented 1 year ago
# vi training/finetune_GPT-NeoXT-Chat-Base-20B.sh

(trap 'kill 0' SIGINT; \
python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \
    & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \
#     & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 2 --rank 2 \
#     & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 3 --rank 3 \
#     & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 4 --rank 4 \
#     & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 5 --rank 5 \
#     & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 6 --rank 6 \
#     & \
# python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 7 --rank 7 \
#     & \
wait)
orangetin commented 1 year ago

Fixed.

For training: Invalid CUDA ID followed by OOM error. Solved by fixing CUDA IDs and using a GPU with the required amount of VRAM for training.

For downloading model: Solved in #63 by offloading parts of the model to disk.

joecodecreations commented 1 year ago

@csris correction: I fixed pretrained/GPT-NeoX-20B/prepare.py not data/OIG/prepare.py. data/OIG/prepare.py ran just fine for me.

Every time I ran pretrained/GPT-NeoX-20B/prepare.py, it ran for a couple minutes and just gave the output "Killed". I figured out the issue, my computer was out of memory. I was able to trace the error back to line 27: model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16)

AutoModelForCausalLM.from_pretrained can accept two more arguments: device_map="auto", offload_folder="SOME_FOLDER". This forces transformers to use the hard disk as cache storage when the RAM isn't enough.

So, to anyone trying to run this on lower-end hardware with not enough RAM, change line 27 of pretrained/GPT-NeoX-20B/prepare.py, to model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder="SOME_FOLDER") and replace SOME_FOLDER with an existing but empty directory.

I'm currently trying to get training running. I was able to go through 3 layers of training before it crashed (out of memory). Tweaking the PyTorch configuration should eliminate this issue. I have traced back the source of the crash, I'll report back when it works. I don't believe the minimum requirements listed should be quite this high, granted the code for bot.py does seem bloated.

Unfortunately, I'm a college student and, as of right now, can't afford 8x A100 80GB GPUs, but I'm determined to make this work XD. I was able to run pretty large models on just a CPU so I think this should be possible with the GeForce.

I see too that there is a argument for --offload-dir

orangetin commented 1 year ago

I see too that there is a argument for --offload-dir

@joecodecreations Yes, that argument was added in the PR mentioned above.

darrinh commented 1 year ago

Fixed.

For training: Invalid CUDA ID followed by OOM error. Solved by fixing CUDA IDs and using a GPU with the required amount of VRAM for training.

For downloading model: Solved in #63 by offloading parts of the model to disk.

How much vram did you end having to use for training?