togethercomputer / OpenChatKit

Apache License 2.0
9k stars 1.01k forks source link

RuntimeError: Socket Timeout #97

Open angeliababy opened 1 year ago

angeliababy commented 1 year ago

sh training/finetune_Pythia-Chat-Base-7B.sh

Namespace(use_cuda=True, cuda_id=0, cuda_num=1, debug_mem=True, dist_backend='cupy_nccl', dp_backend='nccl', dist_url='tcp://127.0.0.1:7033', world_size= train_data=['./glue_dataset/data/QQP/train.tsv'], valid_data=['./glue_dataset/data/QQP/test.tsv'], tokenizer_type='BertWordPieceLowerCase', vocab_file='', train_log_backend='print', project_name='together', batch_size=32, micro_batch_size=1, lr=1e-05, num_iters=10, fp16=True, loss_scale=0, initial_loss_slreduce', gradient_accumulate_step=1, model_name='/data/app/OpenChatKit/training/../pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped/', toketype='gptneox', checkpoint_path='/data/app/OpenChatKit/training/../model_ckpts/Pythia-Chat-Base-7B', task_name='/data/app/OpenChatKit/training/../data/OI_checkpoint=True, seed=42, profiling='no-profiling', trace_postfix='default', evaluation_steps=0, evaluation_data=None, evaluation_num_batch=None, checkp Traceback (most recent call last): File "/data/app/OpenChatKit/training/dist_clm_train.py", line 358, in main() File "/data/app/OpenChatKit/training/dist_clm_train.py", line 275, in main init_communicators(args) File "/data/app/OpenChatKit/training/comm/comm_utils.py", line 85, in init_communicators default_init(args) File "/data/app/OpenChatKit/training/comm/comm_utils.py", line 81, in default_init dist.init_process_group(backend='gloo', timeout=datetime.timedelta(seconds=5*60), init_method=args.dist_url, world_size=args.world_size, rank=args.rank) File "/data/anaconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 761, in init_process_group default_pg = _new_process_group_helper( File "/data/anaconda3/envs/OpenChatKit/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 862, in _new_process_group_helper pg = ProcessGroupGloo(prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Socket Timeout

Error reporting when running with a single gpu.

darrinh commented 1 year ago

Getting same error here.

darrinh commented 1 year ago

some of the other parameters need to be adjusted for single gpu:

< --num-layers 4 --embedding-dim 4096 \ < --world-size 1 Gets me:

Initialize NCCLCommunicator: < pipeline_group_0 >; rank: 0 comm init done!!

but i forgot to download the pretrained model (as per the training instructions), so it stopped there. Will post results once that step is complete.

cheers Darrin

yxy123 commented 1 year ago

Hi Darrin, I'm aslo getting same error here with two gpu. I only modify finetuning script: python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 0 --rank 0 \ & \ python ${DIR}/dist_clm_train.py $(echo ${ARGS}) --cuda-id 1 --rank 1 \ In finetuning aslo need to : --num-layers 4 --embedding-dim 4096 \ --world-size 2 --pipeline-group-size 4 --data-group-size 2 \ right? I have tried with single gpu and modified the related parameters, but met below issue: File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 86, in init_communicators assert args.world_size == args.data_group_size * args.pipeline_group_size AssertionError

Thanks Yuanyuan

darrinh commented 1 year ago

It won't train on my 12GB GPU, it runs out of memory. It requires more VRAM than I currently have.

orangetin commented 1 year ago

@darrinh The fine tuning script will most likely not work on 12 GB VRAM. I'd recommend using LoRa for fine-tuning instead.

Here's some sample code to get you started: https://github.com/togethercomputer/OpenChatKit/blob/ecfe4d5d9b5f4b1a533c4468cc1b7e1107b9a819/training/lora/redpajama-incite-chat-3b.py

darrinh commented 1 year ago

Thanks @orangetin , it starts but quickly runs out of memory. Thanks for the link, will check it out.

thanks

orangetin commented 1 year ago

Hi Darrin, I'm aslo getting same error here with two gpu. I only modify finetuning script: python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 0 --rank 0 & python DIR/distclmtrain.py(echo ${ARGS}) --cuda-id 1 --rank 1 In finetuning aslo need to : --num-layers 4 --embedding-dim 4096 --world-size 2 --pipeline-group-size 4 --data-group-size 2 right? I have tried with single gpu and modified the related parameters, but met below issue: File "/mnt/tet/OpenChatKit-main/training/comm/comm_utils.py", line 86, in init_communicators assert args.world_size == args.data_group_size * args.pipeline_group_size AssertionError

Thanks Yuanyuan

@yxy123 The arguments provided are invalid. args.world_size == args.data_group_size * args.pipeline_group_size must be true.

Change this line > --world-size 2 --pipeline-group-size 4 --data-group-size 2 so that world_size = pipline-group-size * data-group-size

yxy123 commented 1 year ago

@orangetin Got it, thanks very much, it worked.