SFT 时，loss 迅速变为 0

SoYuCry commented 11 months ago

Describe the Question

Please provide a clear and concise description of what the question is.

adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=30000, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=50, evaluation_strategy=steps, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=True, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=e:/Liuc/llama-main/models_hf/7B-Chat-Finetuned\runs\Nov01_10-38-31_PUERSAI-PC, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=e:/Liuc/llama-main/models_hf/7B-Chat-Finetuned, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=4, per_device_train_batch_size=4, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=e:/Liuc/llama-main/models_hf/7B-Chat-Finetuned, save_on_each_node=False, save_safetensors=False, save_steps=500, save_strategy=steps, save_total_limit=3, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.05, warmup_steps=0, weight_decay=0.05, ) 2023-11-01 10:38:31.355 | INFO | main:main:863 - Script args: ScriptArguments(use_peft=True, target_modules='all', lora_rank=8, lora_dropout=0.05, lora_alpha=16.0, modules_to_save=None, peft_path=None, qlora=False, model_max_length=512) 2023-11-01 10:38:31.355 | INFO | main:main:864 - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: True 2023-11-01 10:38:31.450 | INFO | main:main:892 - Add pad token: 2023-11-01 10:38:31.451 | DEBUG | main:main:894 - Tokenizer: LlamaTokenizer(name_or_path='e:/Liuc/llama-main/models_hf/7B-Chat', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '~~', 'eos_token': '~~', 'unk_token': '', 'pad_token': ''}, clean_up_tokenization_spaces=False), added_tokens_decoder={ 0: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 1: AddedToken("~~", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 2: AddedToken("~~", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), } 2023-11-01 10:38:31.452 | INFO | main:main:924 - train files: ['./data/finetune/train\train_data_LLAMA_instruction_alpaca_noinstruct.jsonl'] 2023-11-01 10:38:31.453 | INFO | main:main:929 - eval files: ['./data/finetune/test\test_data_LLAMA_instruction_alpaca_noinstruct.jsonl'] 2023-11-01 10:38:32.741 | INFO | main:main:950 - Raw datasets: DatasetDict({ train: Dataset({ features: ['conversations'], num_rows: 202532 num_rows: 50633 }) }) 2023-11-01 10:38:32.742 | DEBUG | main:main:1038 - Example train_dataset[0]: {'conversations': [{'from': 'human', 'value': "As a proficient assistant, ensure to give accurate and comprehensive responses to the specified questions or tasks.Which university received a NIH grant for 'Micro Coherence Imaging Technology for Assessing Obstructive Lung Disease In Vivo' in 2016?"}, {'from': 'gpt', 'value': "Johns Hopkins University received a 2016 NIH grant for 'Micro Coherence Imaging Technology for Assessing Obstructive Lung Disease In Vivo'."}]} and comprehensive Running tokenizer on train dataset (num_proc=4): 0%| | 0/202532 [00:00<}, {'from': 'gpt'?, ? examples/s]binbin D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dllD:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll ?, ? examples/s]b

binbin D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll bin D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. FlashAttention-2 is not installed, ignore this if you are not using FlashAttention.FlashAttention-2 is not installed, ignore this if you are not using FlashAttention.

FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. Running tokenizer on train dataset (num_proc=4): 8%|████████ | 17000/202532 [00:08<01:03, 294Running tokenizer on train dataset (num_proc=4): 10%|█████████▉ | 21000/202532 [00:09<00:55, 3274.62 examples/s]Running tokenizer on train dataset (num_proc=4): 12%|███████████▊ | 25000/202532 [00:10<00:50, 353Running tokenizer on train dataset (num_proc=4): 14%|█████████████▋ | 29000/202532 [00:11<00:46, 374Running tokenizer on train dataset (num_proc=4): 16%|███████████████▏ | 32000/202532 [00:11<00:35, 483Running tokenizer on train dataset (num_proc=4): 17%|████████████████ | 34000/202532 [00:12<00:42, 400Running tokenizer on train dataset (num_proc=4): 18%|█████████████████ | 36000/202532 [00:12<00:34, 479Running tokenizer on train dataset (num_proc=4): 19%|██████████████████ | 38000/202532 [00:13<00:42, 388Running tokenizer on train dataset (num_proc=4): 20%|██████████████████▉ | 40000/202532 [00:13<00:34, 476Running tokenizer on train dataset (num_proc=4): 21%|███████████████████▉ | 42000/202532 [00:14<00:41, 383Running tokenizer on train dataset (num_proc=4): 22%|████████████████████▊ | 44000/202532 [00:14<00:33, 478Running tokenizer on train dataset (num_proc=4): 22%|█████████████████████▎ | 45000/202532 [00:15<00:48, 327Running tokenizer on train dataset (num_proc=4): 24%|██████████████████████▊ | 48000/202532 [00:15<00:31, 494Running tokenizer on train dataset (num_proc=4): 25%|███████████████████████▋ | 50000/202532 [00:16<00:39, 383Running tokenizer on train dataset (num_proc=4): 26%|████████████████████████▋ | 52000/202532 [00:16<00:31, 475Running tokenizer on train dataset (num_proc=4): 26%|█████████████████████████ | 53000/202532 [00:16<00:45, 328Running tokenizer on train dataset (num_proc=4): 28%|██████████████████████████▌ | 56000/202532 [00:17<00:30, 487Running tokenizer on train dataset (num_proc=4): 28%|███████████████████████████ | 57000/202532 [00:17<00:42, 341Running tokenizer on train dataset (num_proc=4): 30%|████████████████████████████▍ | 60000/202532 [00:18<00:29, 489Running tokenizer on train dataset (num_proc=4): 30%|████████████████████████████▉ | 61000/202532 [00:18<00:41, 340Running tokenizer on train dataset (num_proc=4): 32%|██████████████████████████████▎ | 64000/202532 [00:19<00:27, 500Running tokenizer on train dataset (num_proc=4): 32%|██████████████████████████████▊ | 65000/202532 [00:19<00:40, 343Running tokenizer on train dataset (num_proc=4): 34%|████████████████████████████████▏ | 68000/202532 [00:20<00:26, 505Running tokenizer on train dataset (num_proc=4): 34%|████████████████████████████████▋ | 69000/202532 [00:20<00:38, 346Running tokenizer on train dataset (num_proc=4): 36%|██████████████████████████████████▏ | 72000/202532 [00:20<00:25, 514Running tokenizer on train dataset (num_proc=4): 36%|██████████████████████████████████▌ | 73000/202532 [00:21<00:37, 345Running tokenizer on train dataset (num_proc=4): 38%|████████████████████████████████████ | 76000/202532 [00:21<00:24, 512Running tokenizer on train dataset (num_proc=4): 38%|████████████████████████████████████▍ | 77000/202532 [00:22<00:35, 350Running tokenizer on train dataset (num_proc=4): 39%|█████████████████████████████████████▉ | 80000/202532 [00:22<00:23, 510Running tokenizer on train dataset (num_proc=4): 40%|██████████████████████████████████████▍ | 81000/202532 [00:23<00:35, 344Running tokenizer on train dataset (num_proc=4): 41%|███████████████████████████████████████▊ | 84000/202532 [00:23<00:23, 510Running tokenizer on train dataset (num_proc=4): 42%|████████████████████████████████████████▎ | 85000/202532 [00:24<00:34, 342Running tokenizer on train dataset (num_proc=4): 43%|█████████████████████████████████████████▋ | 88000/202532 [00:24<00:22, 517Running tokenizer on train dataset (num_proc=4): 44%|██████████████████████████████████████████▋ | 90000/202532 [00:25<00:28, 397Running tokenizer on train dataset (num_proc=4): 45%|███████████████████████████████████████████▌ | 92000/202532 [00:25<00:21, 507Running tokenizer on train dataset (num_proc=4): 46%|████████████████████████████████████████████▌ | 94000/202532 [00:26<00:28, 386Running tokenizer on train dataset (num_proc=4): 47%|█████████████████████████████████████████████▌ | 96000/202532 [00:26<00:21, 498Running tokenizer on train dataset (num_proc=4): 48%|██████████████████████████████████████████████▍ | 98000/202532 [00:27<00:27, 379Running tokenizer on train dataset (num_proc=4): 49%|██████████████████████████████████████████████▉ | 100000/202532 [00:27<00:20, 488Running tokenizer on train dataset (num_proc=4): 50%|███████████████████████████████████████████████▊ | 102000/202532 [00:28<00:27, 370Running tokenizer on train dataset (num_proc=4): 51%|████████████████████████████████████████████████▊ | 104000/202532 [00:28<00:20, 484Running tokenizer on train dataset (num_proc=4): 52%|█████████████████████████████████████████████████▋ | 106000/202532 [00:29<00:26, 370Running tokenizer on train dataset (num_proc=4): 53%|██████████████████████████████████████████████████▋ | 108000/202532 [00:29<00:19, 488Running tokenizer on train dataset (num_proc=4): 54%|███████████████████████████████████████████████████▌ | 110000/202532 [00:30<00:25, 367Running tokenizer on train dataset (num_proc=4): 56%|█████████████████████████████████████████████████████ | 113000/202532 [00:31<00:25, 348Running tokenizer on train dataset (num_proc=4): 57%|█████████████████████████████████████████████████████▉ | 115000/202532 [00:31<00:19, 447Running tokenizer on train dataset (num_proc=4): 58%|██████████████████████████████████████████████████████▉ | 117000/202532 [00:32<00:23, 361Running tokenizer on train dataset (num_proc=4): 59%|███████████████████████████████████████████████████████▊ | 119000/202532 [00:32<00:17, 471Running tokenizer on train dataset (num_proc=4): 60%|████████████████████████████████████████████████████████▊ | 121000/202532 [00:33<00:21, 373Running tokenizer on train dataset (num_proc=4): 60%|█████████████████████████████████████████████████████████▏ | 122000/202532 [00:33<00:19, 411Running tokenizer on train dataset (num_proc=4): 61%|██████████████████████████████████████████████████████████▏ | 124000/202532 [00:33<00:14, 556Running tokenizer on train dataset (num_proc=4): 62%|███████████████████████████████████████████████████████████ | 126000/202532 [00:34<00:19, 391Running tokenizer on train dataset (num_proc=4): 63%|████████████████████████████████████████████████████████████ | 128000/202532 [00:34<00:14, 514Running tokenizer on train dataset (num_proc=4): 64%|████████████████████████████████████████████████████████████▉ | 130000/202532 [00:35<00:19, 380Running tokenizer on train dataset (num_proc=4): 65%|█████████████████████████████████████████████████████████████▉ | 132000/202532 [00:35<00:14, 492Running tokenizer on train dataset (num_proc=4): 66%|██████████████████████████████████████████████████████████████▊ | 134000/202532 [00:36<00:17, 380Running tokenizer on train dataset (num_proc=4): 67%|███████████████████████████████████████████████████████████████▊ | 136000/202532 [00:36<00:13, 484Running tokenizer on train dataset (num_proc=4): 68%|████████████████████████████████████████████████████████████████▎ | 137000/202532 [00:36<00:19, 335Running tokenizer on train dataset (num_proc=4): 69%|█████████████████████████████████████████████████████████████████▋ | 140000/202532 [00:37<00:12, 503Running tokenizer on train dataset (num_proc=4): 70%|██████████████████████████████████████████████████████████████████▏ | 141000/202532 [00:37<00:17, 344Running tokenizer on train dataset (num_proc=4): 71%|███████████████████████████████████████████████████████████████████▌ | 144000/202532 [00:38<00:11, 508Running tokenizer on train dataset (num_proc=4): 72%|████████████████████████████████████████████████████████████████████ | 145000/202532 [00:38<00:16, 347Running tokenizer on train dataset (num_proc=4): 73%|█████████████████████████████████████████████████████████████████████▍ | 148000/202532 [00:39<00:10, 510Running tokenizer on train dataset (num_proc=4): 74%|█████████████████████████████████████████████████████████████████████▉ | 149000/202532 [00:39<00:15, 348Running tokenizer on train dataset (num_proc=4): 75%|███████████████████████████████████████████████████████████████████████▎ | 152000/202532 [00:39<00:09, 517Running tokenizer on train dataset (num_proc=4): 76%|███████████████████████████████████████████████████████████████████████▊ | 153000/202532 [00:40<00:14, 349Running tokenizer on train dataset (num_proc=4): 77%|█████████████████████████████████████████████████████████████████████████▏ | 156000/202532 [00:40<00:08, 517Running tokenizer on train dataset (num_proc=4): 78%|█████████████████████████████████████████████████████████████████████████▋ | 157000/202532 [00:41<00:13, 345Running tokenizer on train dataset (num_proc=4): 79%|███████████████████████████████████████████████████████████████████████████ | 160000/202532 [00:41<00:08, 513Running tokenizer on train dataset (num_proc=4): 79%|███████████████████████████████████████████████████████████████████████████▌ | 161000/202532 [00:42<00:11, 353Running tokenizer on train dataset (num_proc=4): 80%|███████████████████████████████████████████████████████████████████████████▉ | 162000/202532 [00:42<00:10, 400Running tokenizer on train dataset (num_proc=4): 81%|████████████████████████████████████████████████████████████████████████████▉ | 164000/202532 [00:42<00:07, 550Running tokenizer on train dataset (num_proc=4): 82%|█████████████████████████████████████████████████████████████████████████████▊ | 166000/202532 [00:43<00:09, 388Running tokenizer on train dataset (num_proc=4): 83%|██████████████████████████████████████████████████████████████████████████████▊ | 168000/202532 [00:43<00:06, 522Running tokenizer on train dataset (num_proc=4): 84%|███████████████████████████████████████████████████████████████████████████████▋ | 170000/202532 [00:44<00:08, 381Running tokenizer on train dataset (num_proc=4): 85%|████████████████████████████████████████████████████████████████████████████████▋ | 172000/202532 [00:44<00:06, 498Running tokenizer on train dataset (num_proc=4): 86%|█████████████████████████████████████████████████████████████████████████████████▌ | 174000/202532 [00:45<00:07, 377Running tokenizer on train dataset (num_proc=4): 87%|██████████████████████████████████████████████████████████████████████████████████▌ | 176000/202532 [00:45<00:05, 497Running tokenizer on train dataset (num_proc=4): 88%|███████████████████████████████████████████████████████████████████████████████████▍ | 178000/202532 [00:46<00:06, 376Running tokenizer on train dataset (num_proc=4): 89%|████████████████████████████████████████████████████████████████████████████████████▍ | 180000/202532 [00:46<00:04, 483Running tokenizer on train dataset (num_proc=4): 90%|█████████████████████████████████████████████████████████████████████████████████████▎ | 182000/202532 [00:47<00:05, 375Running tokenizer on train dataset (num_proc=4): 91%|██████████████████████████████████████████████████████████████████████████████████████▎ | 184000/202532 [00:47<00:03, 472Running tokenizer on train dataset (num_proc=4): 91%|██████████████████████████████████████████████████████████████████████████████████████▊ | 185000/202532 [00:48<00:05, 338Running tokenizer on train dataset (num_proc=4): 92%|███████████████████████████████████████████████████████████████████████████████████████▏ | 186000/202532 [00:48<00:04, 385Running tokenizer on train dataset (num_proc=4): 93%|████████████████████████████████████████████████████████████████████████████████████████▏ | 188000/202532 [00:48<00:02, 524Running tokenizer on train dataset (num_proc=4): 93%|████████████████████████████████████████████████████████████████████████████████████████▋ | 189000/202532 [00:49<00:04, 337Running tokenizer on train dataset (num_proc=4): 94%|█████████████████████████████████████████████████████████████████████████████████████████ | 190000/202532 [00:49<00:03, 395Running tokenizer on train dataset (num_proc=4): 95%|██████████████████████████████████████████████████████████████████████████████████████████ | 192000/202532 [00:49<00:01, 547Running tokenizer on train dataset (num_proc=4): 95%|██████████████████████████████████████████████████████████████████████████████████████████▌ | 193000/202532 [00:50<00:02, 338Running tokenizer on train dataset (num_proc=4): 96%|██████████████████████████████████████████████████████████████████████████████████████████▉ | 194000/202532 [00:50<00:02, 398Running tokenizer on train dataset (num_proc=4): 97%|███████████████████████████████████████████████████████████████████████████████████████████▉ | 196000/202532 [00:50<00:01, 566Running tokenizer on train dataset (num_proc=4): 97%|████████████████████████████████████████████████████████████████████████████████████████████▍ | 197000/202532 [00:51<00:01, 342Running tokenizer on train dataset (num_proc=4): 98%|████████████████████████████████████████████████████████████████████████████████████████████▊ | 198000/202532 [00:51<00:01, 394Running tokenizer on train dataset (num_proc=4): 99%|█████████████████████████████████████████████████████████████████████████████████████████████▊ | 200000/202532 [00:51<00:00, 573Running tokenizer on train dataset (num_proc=4): 99%|██████████████████████████████████████████████████████████████████████████████████████████████▍| 201266/202532 [00:51<00:00, 436Running tokenizer on train dataset (num_proc=4): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 202532/202532 [00:51<00:00, 525Running tokenizer on train dataset (num_proc=4): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 202532/202532 [00:52<00:00, 3828.35 examples/s] Filter (num_proc=4): 0%| | 0/202532 [00:00<?, ? examples/s]binbin D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dllD:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll

bin D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll bin binD:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll FlashAttention-2 is not installed, ignore this if you are not using FlashAttention.FlashAttention-2 is not installed, ignore this if you are not using FlashAttention.

FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. Filter (num_proc=4): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 202532/202532 [00:18<00:00, 11065.27 examples/s] 2023-11-01 10:39:44.347 | DEBUG | main:main:1049 - Num train_samples: 202532 2023-11-01 10:39:44.348 | DEBUG | main:main:1050 - Tokenized training example: 2023-11-01 10:39:44.352 | DEBUG | main:main:1051 - Decode input_ids[0]: ~~A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.~~ USER: You are a capable assistant tasked with delivering precise and well-informed responses to the presented questions or tasks.What methods were used in the study on proliferative diabetic retinopathy? ASSISTANT: Patients were routinely imaged with a standardized PDR-protocol between March 2017 and January 2019. This included a 12x9 mm structural OCT volume centered on the macula and a 6x6 mm OCTA scan centered on the optic nerve head obtained using a Topcon swept-source system. Ultra-widefield fluorescein angiography (FA) was also performed when clinically indicated. 2023-11-01 10:39:44.356 | DEBUG | main:main:1054 - Decode labels[0]: Patients were routinely imaged with a standardized PDR-protocol between March 2017 and January 2019. This included a 12x9 mm structural OCT volume centered on the macula and a 6x6 mm OCTA scan centered on the optic nerve head obtained using a Topcon swept-source system. Ultra-widefield fluorescein angiography (FA) was also performed when clinically indicated. 2023-11-01 10:39:44.357 | DEBUG | main:main:1067 - Example eval_dataset[0]: {'conversations': [{'from': 'human', 'value': 'As a helpful assistant, provide accurate and informative responses to the given questions or tasks based on the provided text. Ensure your answers are precise and to the point.What was the diagnosis of the 16-year-old boy who presented with best-corrected visual acuity of 6/18 OD?'}, {'from': 'gpt', 'value': 'The 16-year-old boy was diagnosed with choroidal osteoma 1 (CO).'}]} Running tokenizer on validation dataset (num_proc=4): 0%| | 0/50633 [00:00<?, ? examples/s]binbinbinbinbin D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dllD:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dllD:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll

FlashAttention-2 is not installed, ignore this if you are not using FlashAttention.FlashAttention-2 is not installed, ignore this if you are not using FlashAttention.

FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. Running tokenizer on validation dataset (num_proc=4): 100%|███████████████████████████████████████████████████████████████████████████| 50633/50633 [00:17<00:00, 2832.42 examples/s] Filter (num_proc=4): 0%| | 0/50633 [00:00<?, ? examples/s]bin D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll binbin D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll bin D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll bin D:\anaconda\envs\LLAMA\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. FlashAttention-2 is not installed, ignore this if you are not using FlashAttention.FlashAttention-2 is not installed, ignore this if you are not using FlashAttention.

FlashAttention-2 is not installed, ignore this if you are not using FlashAttention. Filter (num_proc=4): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50633/50633 [00:07<00:00, 6350.83 examples/s] 2023-11-01 10:40:10.626 | DEBUG | main:main:1077 - Num eval_samples: 50633 2023-11-01 10:40:10.627 | DEBUG | main:main:1078 - Tokenized eval example: 2023-11-01 10:40:10.630 | DEBUG | main:main:1079 - ~~A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.~~ USER: As a helpful assistant, provide accurate and informative responses to the given questions or tasks based on the provided text. Ensure your answers are precise and to the point.What was the diagnosis of the 16-year-old boy who presented with best-corrected visual acuity of 6/18 OD? ASSISTANT: The 16-year-old boy was diagnosed with choroidal osteoma 1 (CO). The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.13s/it] 2023-11-01 10:40:17.863 | INFO | main:main:1195 - Fine-tuning method: LoRA(PEFT) 2023-11-01 10:40:17.864 | INFO | main:main:1200 - Init new peft model 2023-11-01 10:40:17.865 | INFO | main:main:1209 - Peft target_modules: ['down_proj', 'gate_proj', 'k_proj', 'o_proj', 'q_proj', 'up_proj', 'v_proj'] 2023-11-01 10:40:17.866 | INFO | main:main:1210 - Peft lora_rank: 8 trainable params: 19,988,480 || all params: 6,758,404,096 || trainable%: 0.2957573965106688 2023-11-01 10:40:48.923 | INFO | main:main:1256 - Train 2023-11-01 10:40:48.938 | DEBUG | main:main:1259 - Train dataloader example: {'input_ids': tensor([[ 1, 319, 13563, ..., 0, 0, 0], [ 1, 319, 13563, ..., 678, 2, 0], [ 1, 319, 13563, ..., 0, 0, 0], [ 1, 319, 13563, ..., 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0], [1, 1, 1, ..., 1, 1, 0], [1, 1, 1, ..., 0, 0, 0], [1, 1, 1, ..., 0, 0, 0]], device='cuda:0'), 'labels': tensor([[-100, -100, -100, ..., -100, -100, -100], [-100, -100, -100, ..., 678, 2, -100], [-100, -100, -100, ..., -100, -100, -100], [-100, -100, -100, ..., -100, -100, -100]], device='cuda:0')} 2023-11-01 10:40:49.044 | DEBUG | main:main:1260 - Detail input_ids: [tensor([ 1, 319, 13563, 1546, 263, 12758, 1404, 322, 385, 23116, 21082, 20255, 29889, 450, 20255, 4076, 8444, 29892, 13173, 29892, 322, 1248, 568, 6089, 304, 278, 1404, 29915, 29879, 5155, 29889, 2, 3148, 1001, 29901, 1128, 947, 13327, 1207, 6909, 29973, 319, 1799, 9047, 13566, 29901, 29871, 13327, 3732, 1556, 310, 967, 6909, 1549, 18811, 5921, 29889, 2087, 1765, 275, 414, 5146, 278, 5001, 304, 1510, 594, 29879, 304, 967, 4160, 29892, 2729, 373, 1009, 20017, 29892, 4010, 18930, 29892, 4423, 29892, 322, 916, 848, 16531, 491, 13327, 29889, 910, 338, 278, 2769, 13327, 338, 3889, 363, 967, 4160, 29936, 278, 5001, 3732, 6909, 515, 278, 18811, 275, 4110, 29889, 13, 13, 23360, 2909, 884, 2326, 1983, 777, 337, 9947, 515, 967, 916, 5786, 1316, 408, 13327, 28794, 6689, 29892, 988, 4160, 508, 15649, 322, 19417, 9316, 29892, 322, 13327, 402, 11500, 29892, 988, 278, 5001, 4893, 263, 5700, 310, 278, 337, 9947, 5759, 515, 8090, 5318, 373, 967, 7481, 29889, 1670, 526, 884, 777, 916, 9461, 8974, 3704, 5188, 1974, 5786, 363, 5381, 267, 29892, 4066, 20591, 491, 967, 274, 1161, 23986, 29892, 322, 269, 7807, 438, 1810, 375, 478, 29934, 12837, 29889, 13, 13, 797, 15837, 29892, 278, 13638, 310, 13327, 29915, 29879, 337, 9947, 5304, 515, 18811, 5921, 29892, 322, 278, 5001, 29915, 29879, 11509, 304, 3646, 594, 29879, 304, 967, 11118, 1080, 310, 4160, 338, 825, 3732, 372, 697, 310, 278, 1556, 21114, 18811, 5921, 18196, 297, 278, 3186, 29889, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'), tensor([ 1, 319, 13563, 1546, 263, 12758, 1404, 322, 385, 23116, 21082, 20255, 29889, 450, 20255, 4076, 8444, 29892, 13173, 29892, 322, 1248, 568, 6089, 304, 278, 1404, 29915, 29879, 5155, 29889, 2, 3148, 1001, 29901, 25589, 7118, 263, 5183, 1051, 363, 29871, 29947, 1629, 2030, 29879, 29889, 319, 1799, 9047, 2266, 526, 3006, 5972, 29892, 1035, 13190, 29892, 322, 3033, 6751, 8277, 363, 29871, 29947, 1629, 2030, 29879, 29901, 13, 13, 29896, 29889, 376, 1576, 10213, 19152, 414, 29908, 491, 1605, 296, 265, 9371, 22389, 29901, 530, 6382, 262, 1230, 322, 1468, 8873, 17623, 545, 1048, 263, 8023, 1058, 17021, 874, 263, 22277, 2738, 6505, 411, 7934, 10801, 29889, 13, 13, 29906, 29889, 376, 5914, 8276, 371, 29915, 29879, 2563, 29908, 491, 382, 29889, 29933, 29889, 8037, 29901, 319, 5335, 6393, 22037, 1048, 278, 27994, 1546, 263, 282, 335, 4257, 4624, 8399, 322, 263, 805, 1241, 4257, 21499, 29889, 13, 13, 29941, 29889, 376, 1576, 3118, 322, 9333, 16560, 29908, 491, 476, 16490, 2401, 6045, 29901, 319, 9560, 368, 3971, 322, 26848, 9554, 1048, 263, 330, 272, 2911, 8471, 297, 263, 17394, 3262, 286, 497, 1058, 24298, 1983, 1048, 27994, 322, 278, 13500, 310, 9138, 263, 3271, 29889, 13, 13, 29946, 29889, 376, 29925, 22425, 6242, 17712, 292, 29908, 491, 319, 710, 333, 17277, 20378, 29901, 910, 1339, 8238, 29892, 2090, 322, 439, 381, 3459, 5828, 1048, 263, 14183, 29899, 6360, 29899, 1025, 7826, 29892, 7362, 19788, 29892, 411, 2428, 26029, 9324, 1058, 12080, 411, 263, 1601, 1989, 322, 263, 10435, 322, 4947, 964, 599, 17690, 310, 4147, 19363, 29889, 13, 13, 29945, 29889, 376, 9782, 25949, 29908, 491, 1528, 2741, 360, 4494, 29901, 319, 3165, 20657, 322, 5192, 29893, 2817, 292, 17694, 1048, 263, 4123, 7826, 411, 28163, 10801, 322, 902, 16342, 304, 1284, 5360, 322, 3544, 749, 29889, 13, 13, 29953, 29889, 376, 1576, 26494, 15472, 5619, 29908, 491, 6182, 20635, 6657, 4089, 484, 29901, 319, 3652, 310, 8277, 1048, 263, 8099, 322, 9883, 868, 29877, 1058, 6523, 263, 2320, 936, 5447, 8697, 393, 4893, 963, 373, 17623, 1973, 10106, 4955, 29889, 13, 13, 29955, 29889, 376, 1576, 5493, 945, 352, 681, 435, 473, 3801, 310, 9300, 27415, 1662, 29908, 491, 23738, 4671, 29907, 1344, 417, 29901, 530, 23023, 1848, 322, 6023, 292, 17694, 310, 263, 1277, 2242, 475, 27127, 277, 29915, 29879, 16342, 304, 1284, 5360, 322, 278, 1565, 6593, 310, 3271, 29889, 13, 13, 29947, 29889, 376, 14438, 513, 280, 29908, 491, 11571, 26769, 29879, 29901, 319, 2714, 29899, 16123, 17223, 322, 298, 309, 1306, 681, 5828, 1048, 263, 8023, 1058, 11817, 29879, 263, 716, 1734, 322, 6166, 1283, 263, 9704, 310, 4959, 393, 16267, 29879, 263, 5233, 8157, 27836, 29889, 13, 13, 29929, 29889, 376, 29911, 2122, 310, 263, 12458, 386, 4989, 311, 9531, 29908, 491, 8660, 29891, 3164, 2017, 29901, 319, 2090, 1460, 322, 1104, 17219, 5828, 1048, 263, 8023, 16743, 411, 1432, 3250, 2834, 322, 278, 3677, 1199, 310, 670, 286, 783, 10384, 681, 20023, 8099, 29889, 13, 13, 29896, 29900, 29889, 376, 1576, 11773, 4287, 20986, 29908, 491, 5681, 509, 1151, 678, 2, 0], device='cuda:0'), tensor([ 1, 319, 13563, 1546, 263, 12758, 1404, 322, 385, 23116, 21082, 20255, 29889, 450, 20255, 4076, 8444, 29892, 13173, 29892, 322, 1248, 568, 6089, 304, 278, 1404, 29915, 29879, 5155, 29889, 2, 3148, 1001, 29901, 887, 526, 263, 15390, 20255, 3414, 287, 411, 12021, 292, 18378, 322, 1532, 29899, 262, 15628, 20890, 304, 278, 9132, 5155, 470, 9595, 29889, 5618, 338, 278, 18766, 411, 278, 22583, 310, 9377, 18322, 29963, 885, 550, 29973, 319, 1799, 9047, 13566, 29901, 29871, 450, 22583, 310, 9377, 18322, 29963, 885, 550, 338, 18066, 292, 2861, 304, 7200, 7977, 15786, 29892, 5520, 1274, 23493, 3064, 29892, 322, 6590, 7200, 10884, 24238, 29879, 29889, 402, 2547, 297, 848, 1274, 23493, 2861, 304, 10977, 10884, 29892, 1316, 408, 269, 5753, 3076, 29892, 508, 367, 7282, 322, 1009, 1904, 292, 7415, 12187, 363, 9150, 22583, 29889, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2023-11-01 10:40:49.054 | DEBUG | main:main:1261 - Decode input_ids[0]: ~~A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.~~ USER: How does Facebook make money? ASSISTANT: Facebook makes most of its money through advertising. Advertisers pay the company to show ads to its users, based on their interests, behaviors, location, and other data collected by Facebook. This is the reason Facebook is free for its users; the company makes money from the advertisements.

Facebook also earns some revenue from its other services such as Facebook Marketplace, where users can buy and sell products, and Facebook Gaming, where the company takes a cut of the revenue generated from games played on its platform. There are also some other minor sources including premium services for businesses, interest earned by its cash reserve, and selling Oculus VR hardware.

In summary, the majority of Facebook's revenue comes from advertising, and the company's ability to target ads to its billions of users is what makes it one of the most valuable advertising channels in the world. 2023-11-01 10:40:49.091 | DEBUG | main:main:1264 - Decode labels[0]: Facebook makes most of its money through advertising. Advertisers pay the company to show ads to its users, based on their interests, behaviors, location, and other data collected by Facebook. This is the reason Facebook is free for its users; the company makes money from the advertisements.

Facebook also earns some revenue from its other services such as Facebook Marketplace, where users can buy and sell products, and Facebook Gaming, where the company takes a cut of the revenue generated from games played on its platform. There are also some other minor sources including premium services for businesses, interest earned by its cash reserve, and selling Oculus VR hardware.

In summary, the majority of Facebook's revenue comes from advertising, and the company's ability to target ads to its billions of users is what makes it one of the most valuable advertising channels in the world.‘’ 0%| | 0/151899 [00:00<?, ?it/s]D:\anaconda\envs\LLAMA\lib\site-packages\torch\utils\checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0} {'loss': 0.5725, 'learning_rate': 5.266622778143516e-09, 'epoch': 0.0} {'loss': 0.9044, 'learning_rate': 1.0533245556287031e-08, 'epoch': 0.0} {'loss': 0.4325, 'learning_rate': 1.5799868334430548e-08, 'epoch': 0.0} {'loss': 1.6331, 'learning_rate': 2.6333113890717578e-08, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 2.6333113890717578e-08, 'epoch': 0.0} {'eval_loss': nan, 'eval_runtime': 2394.2738, 'eval_samples_per_second': 21.148, 'eval_steps_per_second': 5.287, 'epoch': 0.0} {'loss': 0.4834, 'learning_rate': 2.8966425279789338e-08, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 2.8966425279789338e-08, 'epoch': 0.0} {'loss': 0.8378, 'learning_rate': 3.686635944700461e-08, 'epoch': 0.0} {'loss': 0.5254, 'learning_rate': 4.2132982225148125e-08, 'epoch': 0.0} {'loss': 0.2545, 'learning_rate': 4.476629361421988e-08, 'epoch': 0.0} {'eval_loss': nan, 'eval_runtime': 2418.6764, 'eval_samples_per_second': 20.934, 'eval_steps_per_second': 5.234, 'epoch': 0.0} {'loss': 0.7535, 'learning_rate': 5.2666227781435155e-08, 'epoch': 0.0} {'loss': 0.8062, 'learning_rate': 6.056616194865043e-08, 'epoch': 0.0} {'loss': 0.9459, 'learning_rate': 7.109940750493746e-08, 'epoch': 0.0} {'loss': 0.9086, 'learning_rate': 7.636603028308098e-08, 'epoch': 0.0} {'loss': 1.5865, 'learning_rate': 8.689927583936801e-08, 'epoch': 0.0} {'eval_loss': nan, 'eval_runtime': 2360.6086, 'eval_samples_per_second': 21.449, 'eval_steps_per_second': 5.363, 'epoch': 0.0} {'loss': 0.6831, 'learning_rate': 9.216589861751152e-08, 'epoch': 0.0} {'loss': 0.4008, 'learning_rate': 9.743252139565505e-08, 'epoch': 0.0} {'loss': 0.3915, 'learning_rate': 1.0006583278472681e-07, 'epoch': 0.0} {'loss': 0.6513, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'eval_loss': nan, 'eval_runtime': 2356.5255, 'eval_samples_per_second': 21.486, 'eval_steps_per_second': 5.372, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'eval_loss': nan, 'eval_runtime': 2357.0338, 'eval_samples_per_second': 21.482, 'eval_steps_per_second': 5.371, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'eval_loss': nan, 'eval_runtime': 2356.5559, 'eval_samples_per_second': 21.486, 'eval_steps_per_second': 5.372, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'eval_loss': nan, 'eval_runtime': 2356.1859, 'eval_samples_per_second': 21.489, 'eval_steps_per_second': 5.373, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01}

SoYuCry commented 11 months ago

0%| | 0/151899 [00:00<?, ?it/s]D:\anaconda\envs\LLAMA\lib\site-packages\torch\utils\checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0} {'loss': 0.5725, 'learning_rate': 5.266622778143516e-09, 'epoch': 0.0} {'loss': 0.9044, 'learning_rate': 1.0533245556287031e-08, 'epoch': 0.0} {'loss': 0.4325, 'learning_rate': 1.5799868334430548e-08, 'epoch': 0.0} {'loss': 1.6331, 'learning_rate': 2.6333113890717578e-08, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 2.6333113890717578e-08, 'epoch': 0.0} {'eval_loss': nan, 'eval_runtime': 2394.2738, 'eval_samples_per_second': 21.148, 'eval_steps_per_second': 5.287, 'epoch': 0.0} {'loss': 0.4834, 'learning_rate': 2.8966425279789338e-08, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 2.8966425279789338e-08, 'epoch': 0.0} {'loss': 0.8378, 'learning_rate': 3.686635944700461e-08, 'epoch': 0.0} {'loss': 0.5254, 'learning_rate': 4.2132982225148125e-08, 'epoch': 0.0} {'loss': 0.2545, 'learning_rate': 4.476629361421988e-08, 'epoch': 0.0} {'eval_loss': nan, 'eval_runtime': 2418.6764, 'eval_samples_per_second': 20.934, 'eval_steps_per_second': 5.234, 'epoch': 0.0} {'loss': 0.7535, 'learning_rate': 5.2666227781435155e-08, 'epoch': 0.0} {'loss': 0.8062, 'learning_rate': 6.056616194865043e-08, 'epoch': 0.0} {'loss': 0.9459, 'learning_rate': 7.109940750493746e-08, 'epoch': 0.0} {'loss': 0.9086, 'learning_rate': 7.636603028308098e-08, 'epoch': 0.0} {'loss': 1.5865, 'learning_rate': 8.689927583936801e-08, 'epoch': 0.0} {'eval_loss': nan, 'eval_runtime': 2360.6086, 'eval_samples_per_second': 21.449, 'eval_steps_per_second': 5.363, 'epoch': 0.0} {'loss': 0.6831, 'learning_rate': 9.216589861751152e-08, 'epoch': 0.0} {'loss': 0.4008, 'learning_rate': 9.743252139565505e-08, 'epoch': 0.0} {'loss': 0.3915, 'learning_rate': 1.0006583278472681e-07, 'epoch': 0.0} {'loss': 0.6513, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'eval_loss': nan, 'eval_runtime': 2356.5255, 'eval_samples_per_second': 21.486, 'eval_steps_per_second': 5.372, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.0} {'eval_loss': nan, 'eval_runtime': 2357.0338, 'eval_samples_per_second': 21.482, 'eval_steps_per_second': 5.371, 'epoch': 0.0} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'eval_loss': nan, 'eval_runtime': 2356.5559, 'eval_samples_per_second': 21.486, 'eval_steps_per_second': 5.372, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'eval_loss': nan, 'eval_runtime': 2356.1859, 'eval_samples_per_second': 21.489, 'eval_steps_per_second': 5.373, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01} {'loss': 0.0, 'learning_rate': 1.0533245556287031e-07, 'epoch': 0.01}

SoYuCry commented 11 months ago

我的参数是：

{ // Use IntelliSense to learn about possible attributes. // Hover to view descriptions of existing attributes. // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387 "version": "0.2.0", "configurations": [

    {
        "name": "Python: Current File",
        "type": "python",
        "request": "launch",
        "program": "${file}",
        "console": "integratedTerminal",
        "justMyCode": true,
        "args": [
            "--model_type", "llama",
            "--model_name_or_path", "e:/Liuc/llama-main/models_hf/7B-Chat",
            "--train_file_dir", "./data/finetune/train",
            "--validation_file_dir", "./data/finetune/test",
            "--per_device_train_batch_size", "4",
            "--per_device_eval_batch_size", "4",
            "--do_train",
            "--do_eval",
            "--use_peft", "True",
            "--fp16",
            "--max_train_samples", "-1",
            "--max_eval_samples", "-1",
            "--num_train_epochs", "3",
            "--learning_rate", "2e-5",
            "--warmup_ratio", "0.05",
            "--weight_decay", "0.05",
            "--logging_strategy", "steps",
            "--logging_steps", "10",
            "--eval_steps", "50",
            "--evaluation_strategy", "steps",
            "--save_steps", "500",
            "--save_strategy", "steps",
            "--save_total_limit", "3",
            "--gradient_accumulation_steps", "1",
            "--preprocessing_num_workers", "4",
            "--output_dir", "e:/Liuc/llama-main/models_hf/7B-Chat-Finetuned",
            "--overwrite_output_dir",
            "--ddp_timeout", "30000",
            "--logging_first_step", "True",
            "--target_modules", "all",
            "--lora_rank", "8",
            "--lora_alpha", "16",
            "--lora_dropout", "0.05",
            "--torch_dtype", "bfloat16",
            "--device_map", "auto",
            "--report_to", "tensorboard",
            "--ddp_find_unused_parameters", "False",
            "--gradient_checkpointing", "True"
        ]
    }
]

}

之前把 --torch_dtype , "float16 " 的话，loss 从最一开始就会是 0

tszslovewanpu commented 11 months ago

我也遇到这个问题了，请问有解决办法吗，loss先是震荡，然后变0，eval loss一直是nan，然后等训练结束，去预测直接报错。。。

shibing624 commented 11 months ago

llama-7b chat 吗？

SoYuCry commented 11 months ago

对的

tszslovewanpu commented 11 months ago

llama2 7B base也是这样

SoYuCry commented 11 months ago

准确来说是 llama2-7B chat

shibing624 commented 11 months ago

我重跑了一遍，没有发现loss变0。我在A100机器下执行，脚本是：

CUDA_VISIBLE_DEVICES=1 python supervised_finetuning.py \
    --model_type llama  \
    --model_name_or_path daryl149/llama-2-7b-chat-hf \
    --train_file_dir data/finetune \
    --validation_file_dir data/finetune \
    --template_name vicuna \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 2 \
    --do_train \
    --do_eval \
    --use_peft True \
    --max_train_samples -1 \
    --max_eval_samples 10 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --warmup_ratio 0.05 \
    --weight_decay 0.05 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 50 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 3 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 10 \
    --output_dir outputs-sft-v5-test-llama2 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype bfloat16 --bf16 --optim paged_adamw_32bit \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True --cache_dir ./cache  --neft_alpha 5 --qlora True --load_in_4bit True  --shift_attn True

awmthink commented 11 months ago

在进行半精度（float16）时，adam_epsilon=1e-08，这个会被舍入为0，可以尝试把adam_epsilon设置为1e-4，如果是bf16应该不会有问题。

shibing624 commented 11 months ago

使用 --bf16

shibing624 / MedicalGPT

SFT 时，loss 迅速变为 0 #252

Describe the Question