haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.12k stars 2.21k forks source link

[Usage] RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #726

Open unmo opened 1 year ago

unmo commented 1 year ago

Describe the issue

Issue: I executed cli script in the following command. I have encountered a problem "RuntimeError: probability tensor contains either inf, nan or element < 0".

Using model is a trained on custom data. What is the problem?

Command:

python -m llava.serve.cli \
    --model-path ./checkpoints/llava_vicuna1.5-7b_clip-vit-l-336_task_epoch20\
    --image-file ./playground/data/LLaVA-Pretrain/images/test2/4.png\

Log:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/app/LLaVA/llava/serve/cli.py", line 125, in <module>
    main(args)
  File "/app/LLaVA/llava/serve/cli.py", line 94, in main
    output_ids = model.generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1588, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2678, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
crazycth commented 12 months ago

+1

LiWentomng commented 12 months ago

@unmo @crazycth Hello, any update? I also encounter this problem.

unmo commented 12 months ago

Sorry, I have not been able to resolve it yet either.

haotian-liu commented 12 months ago

Hi,

  1. The first thing to try is to see if it helps by setting --conv-mode llava_v1?
  2. If not, please share: what script did you run to train the model (by providing the exact command)? Also, please share the wandb link to see if the loss curves are normal.
Junxiao-Ma commented 8 months ago

I'm having the same issue: when I train the model to run the run_llava.py, I also get this error, I find that the output is all NAN, but I don't know why it's happening

ghazalsaheb commented 3 months ago

Facing the same issue here. The output is nan although my w&B loss looks fine.

ghazalsaheb commented 3 months ago

Update: I was able to resolve the issue by changing the base model from hugging face's "llava-hf/llava-1.5-7b-hf"to "liuhaotian/llava-v1.5-7b". It resolved the NaN issue and the training performance got much better.