thisserand / alpaca-lora-finetune-language

121 stars 36 forks source link

AssertionError: No inf checks were recorded for this optimizer. #4

Open erjieyong opened 1 year ago

erjieyong commented 1 year ago

First of all, a great thank you for posting the article and youtube video, it was very insightful!

I've tried to run the code based on your article, however i keep facing the same assertion error. Any advice?

Note that i have been trying to run your code on colab with the free gpu python version = 3.9.16 cuda version = 11.8.89

I've also noted the bug that you faced and run the same code (edited for colab) as follows: cp /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cpu.so

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths... warn(msg) /usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')} warn(msg) /usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('http'), PosixPath('//172.28.0.1'), PosixPath('8013')} warn(msg) /usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('--listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https'), PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-2ofb2sppvym87 --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true')} warn(msg) /usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')} warn(msg) /usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//ipykernel.pylab.backend_inline'), PosixPath('module')} warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 7.5 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so... 2023-04-09 06:56:19.227834: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Training Alpaca-LoRA model with params: base_model: decapoda-research/llama-7b-hf data_path: /content/gdrive/MyDrive/LTA_guanacos/alpaca-lora/translated_tasks_de_deepl_4k (1).json output_dir: ./lora-alpaca batch_size: 128 micro_batch_size: 4 num_epochs: 3 learning_rate: 0.0003 cutoff_len: 256 val_set_size: 2000 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True group_by_length: False resume_from_checkpoint: None

Overriding torch_dtype=None with torch_dtype=torch.float16 due to requirements of bitsandbytes to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning. Loading checkpoint shards: 100% 33/33 [01:11<00:00, 2.15s/it] The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. The class this function is called from is 'LlamaTokenizer'. Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-964d8b7c1c693dbd/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e... Downloading data files: 100% 1/1 [00:00<00:00, 3246.37it/s] Extracting data files: 100% 1/1 [00:00<00:00, 64.22it/s] Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-964d8b7c1c693dbd/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data. 100% 1/1 [00:00<00:00, 813.01it/s] trainable params: 0 || all params: 6755192832 || trainable%: 0.0 /usr/local/lib/python3.9/dist-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( 0% 0/45 [00:00<?, ?it/s]╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /content/gdrive/MyDrive/LTA_guanacos/alpaca-lora/finetune_language.py:237 in │ │ │ │ │ │ 234 │ │ 235 │ │ 236 if name == "main": │ │ ❱ 237 │ fire.Fire(train) │ │ 238 │ │ │ │ /usr/local/lib/python3.9/dist-packages/fire/core.py:141 in Fire │ │ │ │ 138 │ context.update(caller_globals) │ │ 139 │ context.update(caller_locals) │ │ 140 │ │ ❱ 141 component_trace = _Fire(component, args, parsed_flag_args, context, │ │ 142 │ │ 143 if component_trace.HasError(): │ │ 144 │ _DisplayError(component_trace) │ │ │ │ /usr/local/lib/python3.9/dist-packages/fire/core.py:475 in _Fire │ │ │ │ 472 │ is_class = inspect.isclass(component) │ │ 473 │ │ │ 474 │ try: │ │ ❱ 475 │ │ component, remaining_args = _CallAndUpdateTrace( │ │ 476 │ │ │ component, │ │ 477 │ │ │ remaining_args, │ │ 478 │ │ │ component_trace, │ │ │ │ /usr/local/lib/python3.9/dist-packages/fire/core.py:691 in │ │ _CallAndUpdateTrace │ │ │ │ 688 │ loop = asyncio.get_event_loop() │ │ 689 │ component = loop.run_until_complete(fn(*varargs, *kwargs)) │ │ 690 else: │ │ ❱ 691 │ component = fn(varargs, *kwargs) │ │ 692 │ │ 693 if treatment == 'class': │ │ 694 │ action = trace.INSTANTIATED_CLASS │ │ │ │ /content/gdrive/MyDrive/LTA_guanacos/alpaca-lora/finetune_language.py:206 in │ │ train │ │ │ │ 203 │ if torch.version >= "2" and sys.platform != "win32": │ │ 204 │ │ model = torch.compile(model) │ │ 205 │ │ │ ❱ 206 │ trainer.train(resume_from_checkpoint=resume_from_checkpoint) │ │ 207 │ │ │ 208 │ model.save_pretrained(output_dir) │ │ 209 │ │ │ │ /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1662 in train │ │ │ │ 1659 │ │ inner_training_loop = find_executable_batch_size( │ │ 1660 │ │ │ self._inner_training_loop, self._train_batch_size, args.a │ │ 1661 │ │ ) │ │ ❱ 1662 │ │ return inner_training_loop( │ │ 1663 │ │ │ args=args, │ │ 1664 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1665 │ │ │ trial=trial, │ │ │ │ /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1991 in │ │ _inner_training_loop │ │ │ │ 1988 │ │ │ │ │ │ │ xm.optimizer_step(self.optimizer) │ │ 1989 │ │ │ │ │ elif self.do_grad_scaling: │ │ 1990 │ │ │ │ │ │ scale_before = self.scaler.get_scale() │ │ ❱ 1991 │ │ │ │ │ │ self.scaler.step(self.optimizer) │ │ 1992 │ │ │ │ │ │ self.scaler.update() │ │ 1993 │ │ │ │ │ │ scale_after = self.scaler.get_scale() │ │ 1994 │ │ │ │ │ │ optimizer_was_run = scale_before <= scale_aft │ │ │ │ /usr/local/lib/python3.9/dist-packages/torch/cuda/amp/grad_scaler.py:368 in │ │ step │ │ │ │ 365 │ │ if optimizerstate["stage"] is OptState.READY: │ │ 366 │ │ │ self.unscale(optimizer) │ │ 367 │ │ │ │ ❱ 368 │ │ assert len(optimizer_state["found_inf_per_device"]) > 0, "No i │ │ 369 │ │ │ │ 370 │ │ retval = self._maybe_opt_step(optimizer, optimizer_state, arg │ │ 371 │ ╰──────────────────────────────────────────────────────────────────────────────╯ AssertionError: No inf checks were recorded for this optimizer.

Sekhar-jami commented 1 year ago

Facing the same error, were you able to resolve the issue?

diogopublio commented 1 year ago

same error here any tips on fixing would be great

nishantb06 commented 1 year ago

same error, possible solutions?

d4nielmeyer commented 1 year ago

Same error. Deep gratitude for any ideas

Kraegge commented 1 year ago

I have the same problem.

erjieyong commented 1 year ago

Based on the example code given, I believe that most of you would have seen the same log message wherein 0 param is trainable, which resulted in the inf error

trainable params: 0 || all params: 6755192832 || trainable%: 0.0

I manage to find another way to load and train from an existing adapter by using resume_from_checkpoint in the original alpaca_lora github.

What you need to do is 1) use the repo from alpaca_lora instead 2) Instead of adjusting any code in finetune.py, just to pass in your downloaded adapter path as part of the parameter. Example:

python finetune.py \
    --base_model='decapoda-research/llama-7b-hf' \
    --num_epochs=10 \
    --cutoff_len=512 \
    --group_by_length \
    --output_dir='./lora-alpaca' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8
    --resume_from_checkpoint='./alpaca-lora-7b'
AzizovAlisher commented 1 year ago

@erjieyong not working, it starts fine-tuning LLAMA, instead of Alpaca. Any other ways?

AzizovAlisher commented 1 year ago

@erjieyong but have to point it out, it resolves the problem with 0 trainable params. Just not in a right way

seyyedaliayati commented 1 year ago

Based on the example code given, I believe that most of you would have seen the same log message wherein 0 param is trainable, which resulted in the inf error

trainable params: 0 || all params: 6755192832 || trainable%: 0.0

I manage to find another way to load and train from an existing adapter by using resume_from_checkpoint in the original alpaca_lora github.

What you need to do is

  1. use the repo from alpaca_lora instead
  2. Instead of adjusting any code in finetune.py, just to pass in your downloaded adapter path as part of the parameter. Example:
python finetune.py \
    --base_model='decapoda-research/llama-7b-hf' \
    --num_epochs=10 \
    --cutoff_len=512 \
    --group_by_length \
    --output_dir='./lora-alpaca' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8
    --resume_from_checkpoint='./alpaca-lora-7b'

Do you mean replacing base model with alpaca? because this command finetunes llama @erjieyong

seyyedaliayati commented 1 year ago

Based on the example code given, I believe that most of you would have seen the same log message wherein 0 param is trainable, which resulted in the inf error

trainable params: 0 || all params: 6755192832 || trainable%: 0.0

I manage to find another way to load and train from an existing adapter by using resume_from_checkpoint in the original alpaca_lora github.

What you need to do is

  1. use the repo from alpaca_lora instead
  2. Instead of adjusting any code in finetune.py, just to pass in your downloaded adapter path as part of the parameter. Example:
python finetune.py \
    --base_model='decapoda-research/llama-7b-hf' \
    --num_epochs=10 \
    --cutoff_len=512 \
    --group_by_length \
    --output_dir='./lora-alpaca' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8
    --resume_from_checkpoint='./alpaca-lora-7b'

Do you mean replacing base model with alpaca? because this command finetunes llama @erjieyong

d4nielmeyer commented 1 year ago

Finally I solved it by initializing the model = get_peft_model(model, config) after model = PeftModel.from_pretrained(model, LORA_WEIGHTS, torch_dtype=torch.float16) and config = LoraConfig(...). So don't comment the config. Worked quite well for me.

fredi-python commented 1 year ago

@d4nielmeyer could you make a pull request?

fredi-python commented 1 year ago

@d4nielmeyer or, please send the code here?

d4nielmeyer commented 1 year ago

Finetune.py

[...]
model = LlamaForCausalLM.from_pretrained(
        base_model,
        load_in_8bit=True,
        torch_dtype=torch.float16,
        device_map=device_map,
    )
[...]
model = prepare_model_for_int8_training(model)

LORA_WEIGHTS = "tloen/alpaca-lora-7b"
model = PeftModel.from_pretrained(
        model,
        LORA_WEIGHTS,
        torch_dtype=torch.float16,
    )

config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules=lora_target_modules,
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM",
    )
model = get_peft_model(model, config)
[...]
seyyedaliayati commented 1 year ago

ig = LoraConfig(...). So don't comment the config. Worked quite well for me.

@d4nielmeyer Thanks! Issue solved for me.

AhmedSSoliman commented 1 year ago

In the parameters inside LoraConfig, I think you are writing inference_mode=True. Change the inference_mode to False as in the following example:

config = LoraConfig( peft_type="LORA", r=8, lora_alpha=32, inference_mode=False, target_modules=["q_proj", "v_proj", "out_proj", "fc1", "fc2","lm_head"], lora_dropout=0.05, bias="none", task_type = "SEQ_2_SEQ_LM" )