shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
3.24k stars 492 forks source link

RuntimeError: CUDA error: device-side assert triggered. indexSelectLargeIndex: block: [58,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed. #275

Closed yyz-selfiie closed 10 months ago

yyz-selfiie commented 10 months ago

FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. 2023-11-29 18:13:51.851 | WARNING | main:post_init:206 - You may set max_train_samples = -1 to run all samples in production. 2023-11-29 18:13:52.656 | INFO | main:main:880 - Model args: ModelArguments(model_type='llama', model_name_or_path='merged-pt', load_in_8bit=False, load_in_4bit=False, tokenizer_name_or_path=None, cache_dir=None, use_fast_tokenizer=False, torch_dtype='float16', device_map='auto', trust_remote_code=True, rope_scaling=None, flash_attn=False, shift_attn=False, neft_alpha=0) 2023-11-29 18:13:52.656 | INFO | main__:main:881 - Data args: DataArguments(dataset_name=None, dataset_config_name=None, train_file_dir='./data/finetune', validation_file_dir='./data/finetune', template_name='vicuna', max_train_samples=1000, max_eval_samples=10, ignore_pad_token_for_loss=True, overwrite_cache=False, validation_split_percentage=1, preprocessing_num_workers=1) 2023-11-29 18:13:52.657 | INFO | main__:main:882 - Training args: Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=30000, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=50, evaluation_strategy=steps, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=outputs-sft-v1/runs/Nov29_18-13-51_ip-172-16-58-143.us-east-2.compute.internal, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=outputs-sft-v1, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=outputs-sft-v1, save_on_each_node=False, save_safetensors=True, save_steps=500, save_strategy=steps, save_total_limit=3, seed=42, skip_memory_metrics=True, sortish_sampler=False, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.05, warmup_steps=0, weight_decay=0.05, ) ... 0%| | 0/993 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [97,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [98,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [99,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [100,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [101,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [102,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [103,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [104,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [105,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [106,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [107,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [108,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [109,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [110,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [111,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [112,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [113,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [114,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [115,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [116,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [117,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [118,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [119,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [120,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [121,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [122,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [123,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [124,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: ...

/opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [22,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [23,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [24,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed. Traceback (most recent call last): File "/home/ec2-user/SageMaker/MedicalGPT/supervised_finetuning.py", line 1346, in main() File "/home/ec2-user/SageMaker/MedicalGPT/supervised_finetuning.py", line 1307, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step loss = self.compute_loss(model, inputs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2748, in compute_loss outputs = model(inputs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/utils/operations.py", line 659, in forward return model_forward(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/utils/operations.py", line 647, in call return convert_to_fp32(self.model_forward(*args, kwargs)) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/peft/peft_model.py", line 1003, in forward return self.base_model( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 107, in forward return self.model.forward(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward outputs = self.model( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, **kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 886, in forward attention_mask = _prepare_4d_causal_attention_mask( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 193, in _prepare_4d_causal_attention_mask attention_mask = attn_mask_converter.to_4d( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 101, in to_4d causal_4d_mask = self._make_causal_mask( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 131, in _make_causal_mask mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

0%| | 0/993 [00:03<?, ?it/s]

yyz-selfiie commented 10 months ago

补充说明:我增量训练了一个在llama 2 7b上使用医学数据预训练过的模型,meditron 7b (https://github.com/epfLLM/meditron)。在这个过程中增加了几个新的token,在added_tokens.json里。继续在这个基础上做sft的时候,就会报上面的错误 /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed. 请问这个还是因为第一步pre-train的时候tokenizer有问题吗?由于是在调试,我使用的都是data文件夹里默认的数据做的训练。

shibing624 commented 10 months ago

加token了,需要resize model

yyz-selfiie commented 10 months ago

谢谢。是要加这句参数吗 --modules_to_save embed_tokens,lm_head
pretraining.py: modules_to_save = script_args.modules_to_save if modules_to_save is not None: modules_to_save = modules_to_save.split(',')

Resize the embedding layer to match the new tokenizer

            embedding_size = model.get_input_embeddings().weight.shape[0]
            if len(tokenizer) > embedding_size:
                model.resize_token_embeddings(len(tokenizer))
yyz-selfiie commented 10 months ago

0%| | 0/656 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/ec2-user/SageMaker/MedicalGPT/pretraining.py", line 742, in main() File "/home/ec2-user/SageMaker/MedicalGPT/pretraining.py", line 703, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 1904, in _inner_training_loop self.accelerator.clip_gradnorm( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 2124, in clip_gradnorm self.unscale_gradients() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 2087, in unscalegradients self.scaler.unscale(opt) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/cuda/amp/gradscaler.py", line 284, in unscale optimizer_state["found_inf_per_device"] = self._unscalegrads(optimizer, inv_scale, found_inf, False) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscalegrads raise ValueError("Attempting to unscale FP16 gradients.") ValueError: Attempting to unscale FP16 gradients.

加这句参数 --modules_to_save embed_tokens,lm_head 会报错