Closed yyz-selfiie closed 10 months ago
补充说明:我增量训练了一个在llama 2 7b上使用医学数据预训练过的模型,meditron 7b (https://github.com/epfLLM/meditron)。在这个过程中增加了几个新的token,在added_tokens.json里。继续在这个基础上做sft的时候,就会报上面的错误 /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [21,0,0] Assertion srcIndex < srcSelectDimSize failed. 请问这个还是因为第一步pre-train的时候tokenizer有问题吗?由于是在调试,我使用的都是data文件夹里默认的数据做的训练。
加token了,需要resize model
谢谢。是要加这句参数吗 --modules_to_save embed_tokens,lm_head
pretraining.py:
modules_to_save = script_args.modules_to_save
if modules_to_save is not None:
modules_to_save = modules_to_save.split(',')
embedding_size = model.get_input_embeddings().weight.shape[0]
if len(tokenizer) > embedding_size:
model.resize_token_embeddings(len(tokenizer))
0%| | 0/656 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/ec2-user/SageMaker/MedicalGPT/pretraining.py", line 742, in
加这句参数 --modules_to_save embed_tokens,lm_head 会报错
FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. 2023-11-29 18:13:51.851 | WARNING | main:post_init:206 - You may set max_train_samples = -1 to run all samples in production. 2023-11-29 18:13:52.656 | INFO | main:main:880 - Model args: ModelArguments(model_type='llama', model_name_or_path='merged-pt', load_in_8bit=False, load_in_4bit=False, tokenizer_name_or_path=None, cache_dir=None, use_fast_tokenizer=False, torch_dtype='float16', device_map='auto', trust_remote_code=True, rope_scaling=None, flash_attn=False, shift_attn=False, neft_alpha=0) 2023-11-29 18:13:52.656 | INFO | main__:main:881 - Data args: DataArguments(dataset_name=None, dataset_config_name=None, train_file_dir='./data/finetune', validation_file_dir='./data/finetune', template_name='vicuna', max_train_samples=1000, max_eval_samples=10, ignore_pad_token_for_loss=True, overwrite_cache=False, validation_split_percentage=1, preprocessing_num_workers=1) 2023-11-29 18:13:52.657 | INFO | main__:main:882 - Training args: Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=30000, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=50, evaluation_strategy=steps, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=outputs-sft-v1/runs/Nov29_18-13-51_ip-172-16-58-143.us-east-2.compute.internal,
logging_first_step=True,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=1.0,
optim=adamw_torch,
optim_args=None,
output_dir=outputs-sft-v1,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=,
ray_scope=last,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=outputs-sft-v1,
save_on_each_node=False,
save_safetensors=True,
save_steps=500,
save_strategy=steps,
save_total_limit=3,
seed=42,
skip_memory_metrics=True,
sortish_sampler=False,
split_batches=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.05,
warmup_steps=0,
weight_decay=0.05,
)
...
0%| | 0/993 [00:00<?, ?it/s]/opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [96,0,0] Assertion
srcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [97,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [98,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [99,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [100,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [101,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [102,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [103,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [104,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [105,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [106,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [107,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [108,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [109,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [110,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [111,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [112,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [113,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [114,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [115,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [116,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [117,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [118,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [119,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [120,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [121,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [122,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [123,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [38,0,0], thread: [124,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: .../opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [21,0,0] Assertion
main()
File "/home/ec2-user/SageMaker/MedicalGPT/supervised_finetuning.py", line 1307, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step
loss = self.compute_loss(model, inputs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2748, in compute_loss
outputs = model(inputs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/utils/operations.py", line 659, in forward
return model_forward(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/utils/operations.py", line 647, in call
return convert_to_fp32(self.model_forward(*args, kwargs))
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/peft/peft_model.py", line 1003, in forward
return self.base_model(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 107, in forward
return self.model.forward(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward
outputs = self.model(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 886, in forward
attention_mask = _prepare_4d_causal_attention_mask(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 193, in _prepare_4d_causal_attention_mask
attention_mask = attn_mask_converter.to_4d(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 101, in to_4d
causal_4d_mask = self._make_causal_mask(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 131, in _make_causal_mask
mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
srcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [22,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [23,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [24,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [25,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [26,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [27,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [28,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [29,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [30,0,0] AssertionsrcIndex < srcSelectDimSize
failed. /opt/conda/conda-bld/pytorch_1686274778240/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [58,0,0], thread: [31,0,0] AssertionsrcIndex < srcSelectDimSize
failed. Traceback (most recent call last): File "/home/ec2-user/SageMaker/MedicalGPT/supervised_finetuning.py", line 1346, inTORCH_USE_CUDA_DSA
to enable device-side assertions.0%| | 0/993 [00:03<?, ?it/s]