jingyaogong / minimind

「大模型」3小时完全从0训练26M的小参数GPT,个人显卡即可推理训练!
https://jingyaogong.github.io/minimind
Apache License 2.0
2.7k stars 329 forks source link

CUDA_HOME does not exist, unable to compile CUDA op(s) #48

Closed ozbillwang closed 4 weeks ago

ozbillwang commented 1 month ago

Got this issue when run the command deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py

CUDA_HOME does not exist, unable to compile CUDA op(s)

Here is the full log

$ deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
[2024-09-27 23:42:23,326] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/home/bill/study/github/jingyaogong/minimind/venv/bin/deepspeed", line 3, in <module>
    from deepspeed.launcher.runner import main
  File "/home/bill/study/github/jingyaogong/minimind/venv/lib/python3.11/site-packages/deepspeed/__init__.py", line 25, in <module>
    from . import ops
  File "/home/bill/study/github/jingyaogong/minimind/venv/lib/python3.11/site-packages/deepspeed/ops/__init__.py", line 15, in <module>
    from ..git_version_info import compatible_ops as __compatible_ops__
  File "/home/bill/study/github/jingyaogong/minimind/venv/lib/python3.11/site-packages/deepspeed/git_version_info.py", line 29, in <module>
    op_compatible = builder.is_compatible()
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bill/study/github/jingyaogong/minimind/venv/lib/python3.11/site-packages/deepspeed/ops/op_builder/fp_quantizer.py", line 35, in is_compatible
    sys_cuda_major, _ = installed_cuda_version()
                        ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bill/study/github/jingyaogong/minimind/venv/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 51, in installed_cuda_version
    raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
deepspeed.ops.op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)

I installed all python packages via Virtualenv .

Notes:

  1. one of packages in rrequirements.txt doesn't support latest python 3.12.x, so I have to use pyenv to install Python 3.11.x

  2. need install nvidia-cuda-toolkit (https://github.com/jingyaogong/minimind/issues/48#issuecomment-2379356422)

  3. (not required) set export CUDA_VISIBLE_DEVICES=0 (https://github.com/jingyaogong/minimind/issues/48#issuecomment-2380418021)

  4. follow #26, but adjust the command to --num_gpus=1, since I have only one GPU

    deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
  5. Got Out of Memory error, as recommended, feed --batch-size, but command deepspeed doesn't support --batch-size yet, so I adjust and run python directly

    python 1-pretrain.py --batch-size 16
  6. Swap is not enable, add a new /swapfile2 with 64GB to /etc/fstab

  7. Run a while now, but killed in the middle, then recommend to adjust max_seq_len to 200 in the file model/LMConfig.py

ozbillwang commented 1 month ago

Fixed with install cuda toolkit

sudo apt install nvidia-cuda-toolkit
ozbillwang commented 1 month ago

Got new error

$ deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
[2024-09-28 00:02:04,410] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 00:02:06,256] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-09-28 00:02:06,256] [INFO] [runner.py:585:main] cmd = /home/bill/study/github/jingyaogong/minimind/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None 1-pretrain.py
[2024-09-28 00:02:08,637] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 00:02:10,402] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-09-28 00:02:10,402] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-09-28 00:02:10,402] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-09-28 00:02:10,402] [INFO] [launch.py:164:main] dist_world_size=2
[2024-09-28 00:02:10,402] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-09-28 00:02:10,403] [INFO] [launch.py:256:main] process 99931 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0']
[2024-09-28 00:02:10,403] [INFO] [launch.py:256:main] process 99932 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1']
usage: 1-pretrain.py [-h] [--out_dir OUT_DIR] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE] [--device DEVICE] [--dtype DTYPE]
                     [--use_wandb] [--wandb_project WANDB_PROJECT] [--num_workers NUM_WORKERS] [--data_path DATA_PATH] [--ddp]
                     [--accumulation_steps ACCUMULATION_STEPS] [--grad_clip GRAD_CLIP] [--warmup_iters WARMUP_ITERS] [--log_interval LOG_INTERVAL]
                     [--save_interval SAVE_INTERVAL]
1-pretrain.py: error: unrecognized arguments: --local_rank=1
usage: 1-pretrain.py [-h] [--out_dir OUT_DIR] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE] [--device DEVICE] [--dtype DTYPE]
                     [--use_wandb] [--wandb_project WANDB_PROJECT] [--num_workers NUM_WORKERS] [--data_path DATA_PATH] [--ddp]
                     [--accumulation_steps ACCUMULATION_STEPS] [--grad_clip GRAD_CLIP] [--warmup_iters WARMUP_ITERS] [--log_interval LOG_INTERVAL]
                     [--save_interval SAVE_INTERVAL]
1-pretrain.py: error: unrecognized arguments: --local_rank=0
[2024-09-28 00:02:25,405] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 99931
[2024-09-28 00:02:25,406] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 99932
[2024-09-28 00:02:25,415] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1'] exits with return code = 2
jingyaogong commented 1 month ago

Got new error

$ deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
[2024-09-28 00:02:04,410] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 00:02:06,256] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-09-28 00:02:06,256] [INFO] [runner.py:585:main] cmd = /home/bill/study/github/jingyaogong/minimind/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None 1-pretrain.py
[2024-09-28 00:02:08,637] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 00:02:10,402] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-09-28 00:02:10,402] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-09-28 00:02:10,402] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-09-28 00:02:10,402] [INFO] [launch.py:164:main] dist_world_size=2
[2024-09-28 00:02:10,402] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-09-28 00:02:10,403] [INFO] [launch.py:256:main] process 99931 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0']
[2024-09-28 00:02:10,403] [INFO] [launch.py:256:main] process 99932 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1']
usage: 1-pretrain.py [-h] [--out_dir OUT_DIR] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE] [--device DEVICE] [--dtype DTYPE]
                     [--use_wandb] [--wandb_project WANDB_PROJECT] [--num_workers NUM_WORKERS] [--data_path DATA_PATH] [--ddp]
                     [--accumulation_steps ACCUMULATION_STEPS] [--grad_clip GRAD_CLIP] [--warmup_iters WARMUP_ITERS] [--log_interval LOG_INTERVAL]
                     [--save_interval SAVE_INTERVAL]
1-pretrain.py: error: unrecognized arguments: --local_rank=1
usage: 1-pretrain.py [-h] [--out_dir OUT_DIR] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE] [--device DEVICE] [--dtype DTYPE]
                     [--use_wandb] [--wandb_project WANDB_PROJECT] [--num_workers NUM_WORKERS] [--data_path DATA_PATH] [--ddp]
                     [--accumulation_steps ACCUMULATION_STEPS] [--grad_clip GRAD_CLIP] [--warmup_iters WARMUP_ITERS] [--log_interval LOG_INTERVAL]
                     [--save_interval SAVE_INTERVAL]
1-pretrain.py: error: unrecognized arguments: --local_rank=0
[2024-09-28 00:02:25,405] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 99931
[2024-09-28 00:02:25,406] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 99932
[2024-09-28 00:02:25,415] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1'] exits with return code = 2

The issue is that the script cannot recognize the --local_rank parameter. DeepSpeed automatically adds the --local_rank parameter when starting distributed training, but the script, which was updated a few days ago, forgot to handle this parameter.

parser.add_argument('--local_rank', type=int, default=-1, help='local rank for distributed training')

This has been added in the fix bug commit.

You can pull the latest code and try again.

Thank you for identifying this potential bug; there was indeed an oversight.

Thanks!

jingyaogong commented 1 month ago

I'm not sure if you have a Chinese background. If you do, that's perfect, and there's no need to change the training set or anything else. Otherwise, you will need to find English training corpora (including both the pretrain and full_sft stages) to replace the current default settings.

The format can be referenced as written in the data_process.py code. You just need to clean and produce jsonl files in the same format, or adjust the code to fit the new dataset. Only the data preprocessing part of data_process.py needs to be modified; nothing else needs to be changed (of course, the test questions in 0-eval-pretrain.py and 2-eval.py also need to be in English, but that's not a big issue). Even if changes are needed, they will be minimal, and I believe you can fully understand and handle them easily. Feel free to reach out if you have any further questions~

ozbillwang commented 1 month ago

(can't type Chinese yet, since I just installed the ubuntu yesterday)

Thanks for the fix, after pull the latest codes, I got another issue

$ deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
[2024-09-28 15:20:42,067] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 15:20:43,849] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-09-28 15:20:43,850] [INFO] [runner.py:585:main] cmd = /home/bill/study/github/jingyaogong/minimind/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None 1-pretrain.py
[2024-09-28 15:20:46,162] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 15:20:47,893] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-09-28 15:20:47,893] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-09-28 15:20:47,893] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-09-28 15:20:47,893] [INFO] [launch.py:164:main] dist_world_size=2
[2024-09-28 15:20:47,893] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-09-28 15:20:47,893] [INFO] [launch.py:256:main] process 6147 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0']
[2024-09-28 15:20:47,894] [INFO] [launch.py:256:main] process 6148 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1']
Traceback (most recent call last):
  File "/home/bill/study/github/jingyaogong/minimind/1-pretrain.py", line 170, in <module>
    init_distributed_mode()
  File "/home/bill/study/github/jingyaogong/minimind/1-pretrain.py", line 127, in init_distributed_mode
    torch.cuda.set_device(DEVICE)
  File "/home/bill/study/github/jingyaogong/minimind/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[2024-09-28 15:20:51,895] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6147
[2024-09-28 15:20:52,048] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6148
[2024-09-28 15:20:52,049] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1'] exits with return code = 1
ozbillwang commented 1 month ago

After I set export CUDA_VISIBLE_DEVICES=0,1, seem the code moved on, but stop with another similar issue

LLM总参数量:26.878 百万

Full logs:

$ deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
[2024-09-28 15:38:32,551] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 15:38:34,404] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected VISIBLE_DEVICES=0,1 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2024-09-28 15:38:34,404] [INFO] [runner.py:585:main] cmd = /home/bill/study/github/jingyaogong/minimind/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None 1-pretrain.py
[2024-09-28 15:38:36,744] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 15:38:38,635] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-09-28 15:38:38,635] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-09-28 15:38:38,636] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-09-28 15:38:38,636] [INFO] [launch.py:164:main] dist_world_size=2
[2024-09-28 15:38:38,636] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-09-28 15:38:38,636] [INFO] [launch.py:256:main] process 9923 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0']
[2024-09-28 15:38:38,637] [INFO] [launch.py:256:main] process 9924 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1']
Traceback (most recent call last):
  File "/home/bill/study/github/jingyaogong/minimind/1-pretrain.py", line 170, in <module>
    init_distributed_mode()
  File "/home/bill/study/github/jingyaogong/minimind/1-pretrain.py", line 127, in init_distributed_mode
    torch.cuda.set_device(DEVICE)
  File "/home/bill/study/github/jingyaogong/minimind/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

LLM总参数量:26.878 百万
[2024-09-28 15:38:43,637] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 9923
[2024-09-28 15:38:43,756] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 9924
[2024-09-28 15:38:43,757] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1'] exits with return code = 1
jingyaogong commented 1 month ago

After I set export CUDA_VISIBLE_DEVICES=0,1, seem the code moved on, but stop with another similar issue在我设置 export CUDA_VISIBLE_DEVICES=0,1 之后,代码似乎继续前进,但因另一个类似的问题而停止

LLM总参数量:26.878 百万

Full logs: 完整日志:

$ deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
[2024-09-28 15:38:32,551] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 15:38:34,404] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected VISIBLE_DEVICES=0,1 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2024-09-28 15:38:34,404] [INFO] [runner.py:585:main] cmd = /home/bill/study/github/jingyaogong/minimind/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None 1-pretrain.py
[2024-09-28 15:38:36,744] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 15:38:38,635] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-09-28 15:38:38,635] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-09-28 15:38:38,636] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-09-28 15:38:38,636] [INFO] [launch.py:164:main] dist_world_size=2
[2024-09-28 15:38:38,636] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-09-28 15:38:38,636] [INFO] [launch.py:256:main] process 9923 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0']
[2024-09-28 15:38:38,637] [INFO] [launch.py:256:main] process 9924 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1']
Traceback (most recent call last):
  File "/home/bill/study/github/jingyaogong/minimind/1-pretrain.py", line 170, in <module>
    init_distributed_mode()
  File "/home/bill/study/github/jingyaogong/minimind/1-pretrain.py", line 127, in init_distributed_mode
    torch.cuda.set_device(DEVICE)
  File "/home/bill/study/github/jingyaogong/minimind/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

LLM总参数量:26.878 百万
[2024-09-28 15:38:43,637] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 9923
[2024-09-28 15:38:43,756] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 9924
[2024-09-28 15:38:43,757] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1'] exits with return code = 1

Everything is working fine, no issues were reproduced.

Are you sure there are two GPUs on the device? If not, set --num_gpus=1. If --num_gpus=1, it will be no different from running python 1-pretrain.py directly. Why use DeepSpeed to launch the script?

😊

ozbillwang commented 1 month ago

Thanks, I follow https://github.com/jingyaogong/minimind/issues/26, and not realized the author has 2 GPUs.

Now, got OutOfMemory error, my GPU is 4060, which has 8GB memory only, seems it doesn't work with this project.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacty of 7.75 GiB of which 49.56 MiB is free. Including non-PyTorch memory, this process has 7.43 GiB memory in use. Of the allocated memory 7.16 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-09-28 16:57:33,448] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6656
[2024-09-28 16:57:34,932] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0'] exits with return code = 1
jingyaogong commented 1 month ago

Thanks, I follow #26, and not realized the author has 2 GPUs.

Now, got OutOfMemory error, my GPU is 4060, which has 8GB memory only, seems it doesn't work with this project.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacty of 7.75 GiB of which 49.56 MiB is free. Including non-PyTorch memory, this process has 7.43 GiB memory in use. Of the allocated memory 7.16 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-09-28 16:57:33,448] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6656
[2024-09-28 16:57:34,932] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0'] exits with return code = 1
parser.add_argument("--batch_size", type=int, default=64, help="Batch size")

You can try adjusting the batch size to 32/16/8 or even 4, and experiment with running it using a batch size smaller than 64. Thank you.

If you start it by running python 1-pretrain.py, Ubuntu is not necessary; Windows will suffice.

ozbillwang commented 1 month ago

Thanks, I follow #26, and not realized the author has 2 GPUs. Now, got OutOfMemory error, my GPU is 4060, which has 8GB memory only, seems it doesn't work with this project.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacty of 7.75 GiB of which 49.56 MiB is free. Including non-PyTorch memory, this process has 7.43 GiB memory in use. Of the allocated memory 7.16 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-09-28 16:57:33,448] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6656
[2024-09-28 16:57:34,932] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0'] exits with return code = 1
parser.add_argument("--batch_size", type=int, default=64, help="Batch size")

You can try adjusting the batch size to 32/16/8 or even 4, and experiment with running it using a batch size smaller than 64. Thank you.

If you start it by running python 1-pretrain.py, Ubuntu is not necessary; Windows will suffice.

It is better, after adjust to --batch_size=16, but after run about 2 hours, it crashed again.

Later, I adjusted the SWAP to 100GB, but still crash in the middle

jingyaogong commented 1 month ago

Thanks, I follow #26, and not realized the author has 2 GPUs. Now, got OutOfMemory error, my GPU is 4060, which has 8GB memory only, seems it doesn't work with this project.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacty of 7.75 GiB of which 49.56 MiB is free. Including non-PyTorch memory, this process has 7.43 GiB memory in use. Of the allocated memory 7.16 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-09-28 16:57:33,448] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6656
[2024-09-28 16:57:34,932] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0'] exits with return code = 1
parser.add_argument("--batch_size", type=int, default=64, help="Batch size")

You can try adjusting the batch size to 32/16/8 or even 4, and experiment with running it using a batch size smaller than 64. Thank you. If you start it by running python 1-pretrain.py, Ubuntu is not necessary; Windows will suffice.

It is better, after adjust to --batch_size=16, but after run about 2 hours, it crashed again.

Later, I adjusted the SWAP to 100GB, but still crash in the middle

@ozbillwang Another approach is to try setting max_seq_len to 200 in the model/LMConfig.py file. Reducing the context length will significantly save on GPU memory, allowing you to maintain a batch size of 32 or even higher.

ozbillwang commented 1 month ago

Thanks, still same issue.