Open z3ugma opened 7 months ago
@younesbelkada you're the author of that sample notebook and the keeper of the football dataset on Hugging Face - any idea what might be causing the loss to go to nan?
@z3ugma the same problem. I started to get the nan loss in the 2nd batch of epoch 0. Have you solved it?
I found an interesting thing. On Google Colab, the loss will not change to nan. It seems that there are still differences between Colab and the local notebook.
I also ran into this issue recently with finetuning BLIP2, whereas it was working before. I haven't had a chance to pin it down, but it might be a package version issue with something introducing a breaking change?
Rolling back to peft=0.5.0 was able to get the blip2 example working for me
@jeffliu-LL which pytorch version are you using?
pytorch 2.0.1 with pytorch-cuda 11.8
I will try rolling back to peft 0.5 with cuda 12.2 and Python 3.11.
Will report back
No, still a problem:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.1.2+cu121
Datasets 2.16.0
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
PEFT 0.5.0
Still shows all nan
after downgrading PyTorch and PEFT unfortunately
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.36.2
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
PEFT 0.5.0
@jeffliu-LL will you put your versions of Python, PyTorch, Transformers, and cuda from the working environment?
Here are the packages from the working Google Colab environment:
Working:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.35.2
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
PEFT 0.5.0
Still not working on Python 3.10 some other version details of another nonworking version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.35.2
Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
PEFT 0.5.0
SciPy 1.11.4
Pillow 9.4.0
seems like we meet the same problem :(
OS: Windows 10 CUDA: 11.8 Python 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)] on win32
Torch 2.1.2+cu118 Datasets 2.16.1 Transformers 4.36.2 PEFT 0.7.1 bitsandbytes 0.41.0
@z3ugma I meet the similar issue.The loss change to nan after epoch 0. Have you fix it? dataset: jpawan33/kag100-image-captioning-dataset pytorch:1.13.0 cuda:11.3 python: 3.9 PEFT: 0.7.2.dev0 transformers: 4.36.2
@wushandinghua no, I've not yet had success
Any solutions? have same problem
I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.
I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.
change this: pixel_values = batch.pop("pixel_values").to(device, torch.float32)
still has same problem
peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb 此笔记本: https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb
在 Google Colab 上训练效果良好:https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW ?usp=sharing#scrollTo=upI97XEH6EKe
使用Python 3.10.12、Torch 2.1.0
它不在我的工作站上进行训练 - 损失在几个时期后就崩溃为 NaN:
Loss: 6.078125 ['a soccer player with his arms up in the air\n', 'mario balotelli celebrates after scoring against juventus\n'] Loss: 3.630859375 Loss: 4.01171875 Epoch: 2 Loss: 4.48046875 ['cristiano ronaldo is the most expensive player in the world\n', 'a soccer player with his arms raised in celebration\n'] Loss: 3.25 Loss: 4.2734375 Epoch: 3 Loss: 4.0625 ['a bald soccer player with a white shirt and blue shorts\n', "Juventus' Mario Mandzukic celebrates after scoring against Barcelona\n"] Loss: 3.01953125 Loss: nan Epoch: 4 Loss: nan
我的工作站也使用 Python 3.10.13 和 Torch 2.1.0。是什么导致损失全部为nan?
Google Colab
Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab. Thank you very much!
Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab. Thank you !
I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.
change this: pixel_values = batch.pop("pixel_values").to(device, torch.float32)
still has same problem
Sorry for the late reply. I also change the model type from torch.float16 to torch.float32 There will need to be two modifications to the code:
Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab. Thank you !
I test with the same code as the https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb on my computer.
您能否向我发送一份在本地部署您的代码的副本?我也想在我自己的电脑上尝试一下,而不是使用 Google Colab。谢谢!
我使用与计算机上 https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb 相同的代码进行测试。
Okay, I will deploy it to my own PyCharm for experimentation
peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb This notebook: https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb
Trains fine on Google Colab at https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing#scrollTo=upI97XEH6EKe
using Python 3.10.12, Torch 2.1.0
It does not train on my workstation - the loss collapses to NaN after just a few epochs:
My workstation is also using Python 3.10.13, and Torch 2.1.0. What could be causing the loss to be all nan?