Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb loss is all NaN - Githubissues

huggingface / notebooks

Notebooks using the Hugging Face libraries 🤗

Apache License 2.0

3.44k stars 1.46k forks source link

Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb loss is all NaN #454

Open z3ugma opened 7 months ago

z3ugma commented 7 months ago

peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb This notebook: https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb

Trains fine on Google Colab at https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing#scrollTo=upI97XEH6EKe

using Python 3.10.12, Torch 2.1.0

It does not train on my workstation - the loss collapses to NaN after just a few epochs:

Loss: 6.078125
['a soccer player with his arms up in the air\n', 'mario balotelli celebrates after scoring against juventus\n']
Loss: 3.630859375
Loss: 4.01171875
Epoch: 2
Loss: 4.48046875
['cristiano ronaldo is the most expensive player in the world\n', 'a soccer player with his arms raised in celebration\n']
Loss: 3.25
Loss: 4.2734375
Epoch: 3
Loss: 4.0625
['a bald soccer player with a white shirt and blue shorts\n', "Juventus' Mario Mandzukic celebrates after scoring against Barcelona\n"]
Loss: 3.01953125
Loss: nan
Epoch: 4
Loss: nan

My workstation is also using Python 3.10.13, and Torch 2.1.0. What could be causing the loss to be all nan?

z3ugma commented 7 months ago

@younesbelkada you're the author of that sample notebook and the keeper of the football dataset on Hugging Face - any idea what might be causing the loss to go to nan?

triangle959 commented 6 months ago

@z3ugma the same problem. I started to get the nan loss in the 2nd batch of epoch 0. Have you solved it?

triangle959 commented 6 months ago

I found an interesting thing. On Google Colab, the loss will not change to nan. It seems that there are still differences between Colab and the local notebook.

jeffliu-LL commented 6 months ago

I also ran into this issue recently with finetuning BLIP2, whereas it was working before. I haven't had a chance to pin it down, but it might be a package version issue with something introducing a breaking change?

jeffliu-LL commented 6 months ago

Rolling back to peft=0.5.0 was able to get the blip2 example working for me

AntoniaSch commented 6 months ago

@jeffliu-LL which pytorch version are you using?

jeffliu-LL commented 6 months ago

pytorch 2.0.1 with pytorch-cuda 11.8

z3ugma commented 6 months ago

I will try rolling back to peft 0.5 with cuda 12.2 and Python 3.11.

Will report back

z3ugma commented 6 months ago

No, still a problem:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.1.2+cu121
Datasets 2.16.0
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
PEFT 0.5.0

z3ugma commented 6 months ago

Still shows all nan after downgrading PyTorch and PEFT unfortunately

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.36.2
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
PEFT 0.5.0

z3ugma commented 6 months ago

@jeffliu-LL will you put your versions of Python, PyTorch, Transformers, and cuda from the working environment?

z3ugma commented 6 months ago

Here are the packages from the working Google Colab environment:

Working: 
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.35.2
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
PEFT 0.5.0

z3ugma commented 6 months ago

Still not working on Python 3.10 some other version details of another nonworking version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.35.2
Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
PEFT 0.5.0
SciPy 1.11.4
Pillow 9.4.0

bit-bcilab commented 5 months ago

seems like we meet the same problem :(

OS: Windows 10 CUDA: 11.8 Python 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)] on win32

Torch 2.1.2+cu118 Datasets 2.16.1 Transformers 4.36.2 PEFT 0.7.1 bitsandbytes 0.41.0

bit-bcilab commented 5 months ago

wushandinghua commented 5 months ago

@z3ugma I meet the similar issue.The loss change to nan after epoch 0. Have you fix it? dataset: jpawan33/kag100-image-captioning-dataset pytorch:1.13.0 cuda:11.3 python: 3.9 PEFT: 0.7.2.dev0 transformers: 4.36.2

z3ugma commented 5 months ago

@wushandinghua no, I've not yet had success

pribadihcr commented 4 months ago

Any solutions? have same problem

eddie221 commented 4 months ago

I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.

pribadihcr commented 4 months ago

I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.

change this: pixel_values = batch.pop("pixel_values").to(device, torch.float32)

still has same problem

shams2023 commented 3 months ago

peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb 此笔记本： https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb

在 Google Colab 上训练效果良好：https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW ?usp=sharing#scrollTo=upI97XEH6EKe

使用Python 3.10.12、Torch 2.1.0

它不在我的工作站上进行训练 - 损失在几个时期后就崩溃为 NaN：
Loss: 6.078125
['a soccer player with his arms up in the air\n', 'mario balotelli celebrates after scoring against juventus\n']
Loss: 3.630859375
Loss: 4.01171875
Epoch: 2
Loss: 4.48046875
['cristiano ronaldo is the most expensive player in the world\n', 'a soccer player with his arms raised in celebration\n']
Loss: 3.25
Loss: 4.2734375
Epoch: 3
Loss: 4.0625
['a bald soccer player with a white shirt and blue shorts\n', "Juventus' Mario Mandzukic celebrates after scoring against Barcelona\n"]
Loss: 3.01953125
Loss: nan
Epoch: 4
Loss: nan
我的工作站也使用 Python 3.10.13 和 Torch 2.1.0。是什么导致损失全部为nan？

Google Colab

Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab. Thank you very much!

shams2023 commented 3 months ago

Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab. Thank you !

eddie221 commented 3 months ago

I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.

change this: pixel_values = batch.pop("pixel_values").to(device, torch.float32)

still has same problem

Sorry for the late reply. I also change the model type from torch.float16 to torch.float32 There will need to be two modifications to the code:

model = Blip2ForConditionalGeneration.from_pretrained("ybelkada/blip2-opt-2.7b-fp16-sharded", device_map="auto", load_in_8bit=True, torch_dtype=torch.float32)
pixel_values = batch.pop("pixel_values").to(device, torch.float32) Here is the notebook of my testing result: https://colab.research.google.com/drive/1j2jey-OqmtUa3IcI1kOcswWAWmiG4JKJ?usp=sharing

eddie221 commented 3 months ago

Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab. Thank you !

I test with the same code as the https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb on my computer.

shams2023 commented 3 months ago

您能否向我发送一份在本地部署您的代码的副本？我也想在我自己的电脑上尝试一下，而不是使用 Google Colab。谢谢！

我使用与计算机上 https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb 相同的代码进行测试。

Okay, I will deploy it to my own PyCharm for experimentation