你好,当训练环境是AMD ROCM环境时,执行run_pt.sh会报错,错误如下:
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_HIP_DSA to enable device-side assertions.
详细错误如下:
2024-05-08 08:32:59.501 | INFO | main:main:381 - Script args: ScriptArguments(use_peft=True, target_modules='all', lora_rank=8, lora_dropout=0.05, lora_alpha=16.0, modules_to_save=None, peft_path=None, qlora=False)
2024-05-08 08:32:59.501 | INFO | main:main:382 - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False
/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
2024-05-08 08:33:00.792 | INFO | main:main:492 - train files: ['./data/pretrain/fever.txt', './data/pretrain/en_article_tail500.txt', './data/pretrain/tianlongbabu.txt']
2024-05-08 08:33:00.792 | INFO | main:main:502 - eval files: ['./data/pretrain/fever.txt', './data/pretrain/en_article_tail500.txt', './data/pretrain/tianlongbabu.txt']
2024-05-08 08:33:01.847 | INFO | main:main:534 - Raw datasets: DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 3876
})
validation: Dataset({
features: ['text'],
num_rows: 3876
})
})
2024-05-08 08:33:02.298 | DEBUG | main:main:597 - Num train_samples: 1230
2024-05-08 08:33:02.298 | DEBUG | main:main:598 - Tokenized training example:
2024-05-08 08:33:02.300 | DEBUG | main:main:599 - 第一章论
传染病是指由病原微生物,如朊粒、病毒、衣原体、立克次体、支原体(mycoplasma)细菌真菌、螺旋体和寄生虫,如原虫、蠕虫、医学昆虫感染人体后产生的有传染性、在一定条件下可造成流行的疾病。感染性疾病是指由病原体感染所致的疾病,包括传染病和非传染性感染性疾病。
传染病学是一门研究各种传染病在人体内外发生、发展、传播、诊断、治疗和预防规律的学科。重点研究各种传染病的发病机制、临床表现、诊断和治疗方法,同时兼顾流行病学和预防措施的研究,做到防治结合。
传染病学与其他学科有密切联系,其基础学科和相关学科包括病原生物学、分子生物学、免疫学、人体寄生虫学、流行病学、病理学、药理学和诊断学等。掌握这些学科的基本知识、基本理论和基本技能对学好传染病学起着非常重要的作用。
在人类历史长河中,传染病不仅威胁着人类的健康和生命,而且影响着人类文明的进程,甚至改写过人类历史。人类在与传染病较量过程中,取得了许多重大战果,19世纪以来,病原微生物的不断发现及其分子生物学的兴起,
2024-05-08 08:33:02.301 | DEBUG | main:main:611 - Num eval_samples: 10
2024-05-08 08:33:02.301 | DEBUG | main:main:612 - Tokenized eval example:
2024-05-08 08:33:02.303 | DEBUG | main:main:613 - 第一章论
传染病是指由病原微生物,如朊粒、病毒、衣原体、立克次体、支原体(mycoplasma)细菌真菌、螺旋体和寄生虫,如原虫、蠕虫、医学昆虫感染人体后产生的有传染性、在一定条件下可造成流行的疾病。感染性疾病是指由病原体感染所致的疾病,包括传染病和非传染性感染性疾病。
传染病学是一门研究各种传染病在人体内外发生、发展、传播、诊断、治疗和预防规律的学科。重点研究各种传染病的发病机制、临床表现、诊断和治疗方法,同时兼顾流行病学和预防措施的研究,做到防治结合。
传染病学与其他学科有密切联系,其基础学科和相关学科包括病原生物学、分子生物学、免疫学、人体寄生虫学、流行病学、病理学、药理学和诊断学等。掌握这些学科的基本知识、基本理论和基本技能对学好传染病学起着非常重要的作用。
在人类历史长河中,传染病不仅威胁着人类的健康和生命,而且影响着人类文明的进程,甚至改写过人类历史。人类在与传染病较量过程中,取得了许多重大战果,19世纪以来,病原微生物的不断发现及其分子生物学的兴起,
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):
File "/home/lyrccla/MGPT/MedicalGPT/pretraining.py", line 780, in
main()
File "/home/lyrccla/MGPT/MedicalGPT/pretraining.py", line 660, in main
model = model_class.from_pretrained(
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3609, in from_pretrained
max_memory = get_balanced_memory(
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 910, in get_balanced_memory
max_memory = get_max_memory(max_memory)
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 781, in get_max_memory
max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 781, in
max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
File "/home/lyrccla/pytorch/torch/cuda/memory.py", line 663, in mem_get_info
return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_HIP_DSA to enable device-side assertions.
你好,当训练环境是AMD ROCM环境时,执行run_pt.sh会报错,错误如下: RuntimeError: HIP error: invalid argument HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing HIP_LAUNCH_BLOCKING=1. Compile with
TORCH_USE_HIP_DSA
to enable device-side assertions.请问本模型无法在ROCM平台下运行吗。
谢谢。
run_pt.sh内容: HIP_VISIBLE_DEVICES=0 python pretraining.py \ --model_type auto \ --model_name_or_path Qwen/Qwen1.5-0.5B-Chat \ --train_file_dir ./data/pretrain \ --validation_file_dir ./data/pretrain \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --do_train \ --do_eval \ --use_peft True \ --seed 42 \ --max_train_samples 10000 \ --max_eval_samples 10 \ --num_train_epochs 0.5 \ --learning_rate 2e-4 \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 13 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 10 \ --block_size 512 \ --group_by_length True \ --output_dir outputs-pt-qwen-v1 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype bfloat16 \ --bf16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True \ --cache_dir ./cache
详细错误如下: 2024-05-08 08:32:59.501 | INFO | main:main:381 - Script args: ScriptArguments(use_peft=True, target_modules='all', lora_rank=8, lora_dropout=0.05, lora_alpha=16.0, modules_to_save=None, peft_path=None, qlora=False) 2024-05-08 08:32:59.501 | INFO | main:main:382 - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False /home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:
main()
File "/home/lyrccla/MGPT/MedicalGPT/pretraining.py", line 660, in main
model = model_class.from_pretrained(
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3609, in from_pretrained
max_memory = get_balanced_memory(
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 910, in get_balanced_memory
max_memory = get_max_memory(max_memory)
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 781, in get_max_memory
max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 781, in
max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
File "/home/lyrccla/pytorch/torch/cuda/memory.py", line 663, in mem_get_info
return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
. warnings.warn( 2024-05-08 08:33:00.792 | INFO | main:main:492 - train files: ['./data/pretrain/fever.txt', './data/pretrain/en_article_tail500.txt', './data/pretrain/tianlongbabu.txt'] 2024-05-08 08:33:00.792 | INFO | main:main:502 - eval files: ['./data/pretrain/fever.txt', './data/pretrain/en_article_tail500.txt', './data/pretrain/tianlongbabu.txt'] 2024-05-08 08:33:01.847 | INFO | main:main:534 - Raw datasets: DatasetDict({ train: Dataset({ features: ['text'], num_rows: 3876 }) validation: Dataset({ features: ['text'], num_rows: 3876 }) }) 2024-05-08 08:33:02.298 | DEBUG | main:main:597 - Num train_samples: 1230 2024-05-08 08:33:02.298 | DEBUG | main:main:598 - Tokenized training example: 2024-05-08 08:33:02.300 | DEBUG | main:main:599 - 第一章论 传染病是指由病原微生物,如朊粒、病毒、衣原体、立克次体、支原体(mycoplasma)细菌真菌、螺旋体和寄生虫,如原虫、蠕虫、医学昆虫感染人体后产生的有传染性、在一定条件下可造成流行的疾病。感染性疾病是指由病原体感染所致的疾病,包括传染病和非传染性感染性疾病。 传染病学是一门研究各种传染病在人体内外发生、发展、传播、诊断、治疗和预防规律的学科。重点研究各种传染病的发病机制、临床表现、诊断和治疗方法,同时兼顾流行病学和预防措施的研究,做到防治结合。 传染病学与其他学科有密切联系,其基础学科和相关学科包括病原生物学、分子生物学、免疫学、人体寄生虫学、流行病学、病理学、药理学和诊断学等。掌握这些学科的基本知识、基本理论和基本技能对学好传染病学起着非常重要的作用。 在人类历史长河中,传染病不仅威胁着人类的健康和生命,而且影响着人类文明的进程,甚至改写过人类历史。人类在与传染病较量过程中,取得了许多重大战果,19世纪以来,病原微生物的不断发现及其分子生物学的兴起, 2024-05-08 08:33:02.301 | DEBUG | main:main:611 - Num eval_samples: 10 2024-05-08 08:33:02.301 | DEBUG | main:main:612 - Tokenized eval example: 2024-05-08 08:33:02.303 | DEBUG | main:main:613 - 第一章论 传染病是指由病原微生物,如朊粒、病毒、衣原体、立克次体、支原体(mycoplasma)细菌真菌、螺旋体和寄生虫,如原虫、蠕虫、医学昆虫感染人体后产生的有传染性、在一定条件下可造成流行的疾病。感染性疾病是指由病原体感染所致的疾病,包括传染病和非传染性感染性疾病。 传染病学是一门研究各种传染病在人体内外发生、发展、传播、诊断、治疗和预防规律的学科。重点研究各种传染病的发病机制、临床表现、诊断和治疗方法,同时兼顾流行病学和预防措施的研究,做到防治结合。 传染病学与其他学科有密切联系,其基础学科和相关学科包括病原生物学、分子生物学、免疫学、人体寄生虫学、流行病学、病理学、药理学和诊断学等。掌握这些学科的基本知识、基本理论和基本技能对学好传染病学起着非常重要的作用。 在人类历史长河中,传染病不仅威胁着人类的健康和生命,而且影响着人类文明的进程,甚至改写过人类历史。人类在与传染病较量过程中,取得了许多重大战果,19世纪以来,病原微生物的不断发现及其分子生物学的兴起, The argumenttrust_remote_code
is to be used with Auto classes. It has no effect here and is ignored. Traceback (most recent call last): File "/home/lyrccla/MGPT/MedicalGPT/pretraining.py", line 780, inTORCH_USE_HIP_DSA
to enable device-side assertions.