togethercomputer / OpenChatKit

Apache License 2.0
9k stars 1.01k forks source link

When do model offline Training , met below issue #128

Closed yxy123 closed 1 year ago

yxy123 commented 1 year ago

Error log: /mnt/tet/OpenChatKit-main/training --model-name /mnt/tet/OpenChatKit-main/training/../pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped/ --tokenizer-name /mnt/tet/OpenChatKit-main/training/../pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped/ --project-name together --model-type gptneox --optimizer adam --seed 42 --load-pretrained-model true --task-name /mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_ni.jsonl:0.2,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_p3.jsonl:0.5,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_flan.jsonl:0.2,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_chip2.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_rallio_safety_and_prosocial.jsonl:0.1,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_soda_dialog.jsonl:0.1,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_unifiedskg_instructions.jsonl:0.1,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_merged_code_xp3.jsonl:0.1,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_oscar_en_sample_dialog.jsonl:0.1,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_ul2_plus_oscar_en_sample_dialog.jsonl:0.1,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_multi_news.jsonl:0.05,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_openai_summarize_tldr.jsonl:0.05,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_squad_v2.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_nq.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_poetry_instructions.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_sqlv2.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_unnatural_instructions.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_conv_finqa.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_essays.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_plot_screenplay_books_dialog.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_grade_school_math_instructions.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_mathqa_flanv2_kojma_cot.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_joke_explanations.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_cuad.jsonl:0.01,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_abstract_infill.jsonl:0.1,/mnt/tet/OpenChatKit-main/training/../data/OIG/files/unified_image_prompts_instructions.jsonl:0.01 --checkpoint-path /mnt/tet/OpenChatKit-main/training/../model_ckpts/Pythia-Chat-Base-7B --total-steps 20000 --warmup-steps 10 --train-warmup-steps 0 --checkpoint-steps 100 --lr 1e-5 --seq-length 2048 --batch-size 32 --micro-batch-size 1 --gradient-accumulate-step 1 --num-layers 8 --embedding-dim 4096 --world-size 8 --pipeline-group-size 4 --data-group-size 2 --job-id 0 --net-interface lo --fp16 --dp-backend nccl --dp-mode allreduce --pp-mode gpipe --profiling no-profiling


True exception*****


True exception*****

Traceback (most recent call last): File "/mnt/tet/OpenChatKit-main/training/dist_clm_train.py", line 360, in main() File "/mnt/tet/OpenChatKit-main/training/dist_clm_train.py", line 277, in main init_communicators(args) File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 85, in init_communicators default_init(args) File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 81, in default_init dist.init_process_group(backend='gloo', timeout=datetime.timedelta(seconds=360), init_method=args.dist_url, world_size=args.world_size, rank=args.rank) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 786, in init_process_group _store_based_barrier(rank, store, timeout) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 346, in _store_based_barrier raise RuntimeError( RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=1, timeout=0:03:00) Traceback (most recent call last): File "/mnt/tet/OpenChatKit-main/training/dist_clm_train.py", line 360, in main() File "/mnt/tet/OpenChatKit-main/training/dist_clm_train.py", line 277, in main init_communicators(args) File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 85, in init_communicators default_init(args) File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 81, in default_init dist.init_process_group(backend='gloo', timeout=datetime.timedelta(seconds=360), init_method=args.dist_url, world_size=args.world_size, rank=args.rank) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 786, in init_process_group _store_based_barrier(rank, store, timeout) File "/root/miniconda3/envs/myconda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 346, in _store_based_barrier raise RuntimeError( RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=1, timeout=0:03:00)

orangetin commented 1 year ago

How many GPUs are you running this on? IIRC, this is an issue when running on <8 GPUs with world-size=8

yxy123 commented 1 year ago

With 2 GPU; Modify training parameters as below: --num-layers 4 --embedding-dim 4096 \ --world-size 2 --pipeline-group-size 2 --data-group-size 1 \

It seems passed without above error, but meet memory issues as below:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.74 GiB total capacity; 14.26 GiB already allocated; 12.69 MiB free; 14.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Running enviroment: GPU Graphics card memory 32 GB GPU bandwidth Memory available 120 GB Hard drive Available Space 400 G

orangetin commented 1 year ago

@yxy123 , you ran out of GPU memory. Try decreasing the batch-size or using a smaller model (like togethercomputer/RedPajama-INCITE-Chat-3B-v1).

Let me know if that works.

Closing issue as the original error is fixed.

yxy123 commented 1 year ago

1) Ran pretriaing with RedPajama-3B successfully (myconda) root@qd32bL:/mnt/oldKit/OpenChatKit-main/pretrained/RedPajama-3B# ls prepare.py togethercomputer_RedPajama-INCITE-Chat-3B-v1

2) Then ran training finetune_RedPajama-INCITE-Chat-3B-v1.sh , modify some parameters as below: Don't know if there's any parameters also need to change in finetune_RedPajama-INCITE-Chat-3B-v1.sh.

ARGS="--model-name ${BASE_MODEL} \ --tokenizer-name ${BASE_MODEL} \ --project-name together \ --model-type gptneox \ --optimizer adam \ --seed 42 \ --load-pretrained-model true \ --task-name \ "${DATASETS}" \ --checkpoint-path ${CHECKPOINT_PATH} \ --total-steps ${TOTAL_STEPS} --warmup-steps 0 --train-warmup-steps 0 \ --checkpoint-steps ${CHECKPOINT_STEPS} \ --lr 1e-5 --seq-length 2048 --batch-size 32 --micro-batch-size 1 --gradient-accumulate-step 1 \ --dist-url tcp://127.0.0.1:7033 \ --num-layers 4 --embedding-dim 2560 \ --world-size 1 --pipeline-group-size 1 --data-group-size 1 \ --job-id 0 --net-interface ${netif} \ --fp16 \ --dp-backend nccl \ --dp-mode allreduce \ --pp-mode gpipe --profiling no-profiling"

(trap 'kill 0' SIGINT; \ python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \ & \

python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \

& \

3) Ran bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh, also met memory issue as below:

Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 15.74 GiB total capacity; 13.21 GiB already allocated; 284.69 MiB free; 14.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

yxy123 commented 1 year ago

@orangetin could you help to check when ran bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh, also met memory issue as above?