CPU out of memory with 128GB, how can I fit in it

(gh_llama-tune) amd00@asus00:~/llm_dev/llama-tune$ CUDA_VISIBLE_DEVICES=0 deepspeed tune.py [2023-06-04 23:27:03,347] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=0: setting --include=localhost:0 [2023-06-04 23:27:03,371] [INFO] [runner.py:541:main] cmd = /home/amd00/anaconda3/envs/gh_llama-tune/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None tune.py [2023-06-04 23:27:04,632] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]} [2023-06-04 23:27:04,632] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-06-04 23:27:04,632] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-06-04 23:27:04,632] [INFO] [launch.py:247:main] dist_world_size=1 [2023-06-04 23:27:04,632] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:07<00:00, 4.36it/s] Full Train dataset size: 41601
Full Eval dataset size: 10401 Small Train dataset size: 1000 Small Eval dataset size: 1000 [2023-06-04 23:27:39,426] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-06-04 23:27:39,431] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-06-04 23:27:40,632] [WARNING] [cpu_adam.py:84:init] FP16 params for CPUAdam may not work on AMD CPUs Using /home/amd00/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/amd00/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.325794219970703 seconds Using /home/amd00/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Emitting ninja build file /home/amd00/.cache/torch_extensions/py310_cu117/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.05765795707702637 seconds Rank: 0 partition count [1] and sizes[(6738415616, False)] [2023-06-04 23:28:22,765] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 4956 [2023-06-04 23:28:22,792] [ERROR] [launch.py:434:sigkill_handler] ['/home/amd00/anaconda3/envs/gh_llama-tune/bin/python3', '-u', 'tune.py', '--local_rank=0'] exits with return code = -9 (gh_llama-tune) amd00@asus00:~/llm_dev/llama-tune$

lxe / llama-tune

CPU out of memory with 128GB, how can I fit in it #4