ollama / ollama

Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
https://ollama.com
MIT License
95.13k stars 7.53k forks source link

Moondream fails at some images, unexpected output/messages? #6365

Closed carlosforster closed 2 months ago

carlosforster commented 2 months ago

What is the issue?

I'm receiving this message when running moondream on a few jpg images:

[GIN] 2024/08/14 - 22:17:49 | 200 | 407.4667ms | 127.0.0.1 | POST "/api/chat" time=2024-08-14T22:18:35.703-03:00 level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"

The terminal output is something like: [0.64, 0.5, 0.72, 0.58] instead of a textual description.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.6

rick-github commented 2 months ago

The warning is because the server is not configured with OLLAMA_NUM_PARALLEL=1, but doesn't the operation of the model.

Does it always fail on the same images? Can you post an example of an image that fails? How do you call the model - CLI, API, python client, etc? Can you provide code that can be run that replicates the fault? Can you provide server logs for a failed inference attempt?

carlosforster commented 2 months ago

Thanks for your interest.

Happens deterministically only to the same (few) images and always to them. Fails with both CLI and ollama-python. In the case of my python script it stalls, probably because of unexpected output format.

Windows session

Microsoft Windows [versão 10.0.22631.4037]
(c) Microsoft Corporation. Todos os direitos reservados.

D:\ImageAnot>ollama run moondream
>>> IMG-20220126-WA0000.jpg

>>> "IMG-20220126-WA0000.jpg"

 [0.64, 0.5, 0.72, 0.58]

>>> "IMG-20220330-WA0001.jpg"

 [0.63, 0.49, 0.73, 0.57]

>>> /bye

Example image files that result in failure:

IMG-20220126-WA0000

IMG-20220330-WA0001

server log

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes
llama_model_loader: loaded meta data with 20 key-value pairs and 245 tensors from .... (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = moondream2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2048
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi2.block_count u32              = 24
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  147 tensors
llama_model_loader: - type q4_0:   97 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
llm_load_vocab: special tokens cache size = 944
llm_load_vocab: token to piece cache size = 0.3151 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 1.42 B
llm_load_print_meta: model size       = 788.55 MiB (4.66 BPW) 
llm_load_print_meta: general.name     = moondream2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 50256 '<|endoftext|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =    56.25 MiB
llm_load_tensors:      CUDA0 buffer size =   732.30 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.20 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   160.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.01 MiB
llama_new_context_with_model: graph nodes  = 921
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="6788" timestamp=1723766686
time=2024-08-15T21:04:46.557-03:00 level=INFO source=server.go:632 msg="llama runner started in 29.03 seconds"
[GIN] 2024/08/15 - 21:04:46 | 200 |   29.8517702s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-15T21:06:03.612-03:00 level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
[GIN] 2024/08/15 - 21:06:04 | 200 |    526.0926ms |       127.0.0.1 | POST     "/api/chat"
time=2024-08-15T21:06:12.622-03:00 level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
[GIN] 2024/08/15 - 21:06:14 | 200 |    1.5735209s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-15T21:06:34.538-03:00 level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
[GIN] 2024/08/15 - 21:06:36 | 200 |    2.0112613s |       127.0.0.1 | POST     "/api/chat"
carlosforster commented 2 months ago

Looks like Moondream may output a bounding box instead of a description. I was not expecting it! Maybe it is just a matter of dealing with this other operation type and discovering how to make Moondream understand I want a description.

Additionally, it looks like the warning is unrelated to the strange behavior of the model as you mentioned.

More tests

Well, I tried to add a prompt message, but the behavior is too strange. Same description to both text images. At least, I have a description output now and my script should not crash.

>>> Describe the image "IMG-20220126-WA0000.jpg"

The image is a blurry photo of an outdoor scene, possibly taken during sunset or dusk. The focus appears to be on
the sky and the surrounding environment rather than the main subject in the frame.

>>> "IMG-20220126-WA0000.jpg"

The image shows a blurry picture of a sunset over an outdoor scene, with the sun setting behind some trees or
other objects. The focus is on the sky and the surrounding environment rather than the main subject in the frame.

>>> Describe the image "IMG-20220330-WA0001.jpg"

The image shows a blurry picture of an outdoor scene, possibly taken during sunset or dusk. The focus appears to
be on the sky and the surrounding environment rather than the main subject in the frame.

Server log

time=2024-08-15T21:32:29.780-03:00 level=INFO source=server.go:632 msg="llama runner started in 6.36 seconds"
[GIN] 2024/08/15 - 21:32:29 | 200 |    7.2102511s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-15T21:32:46.241-03:00 level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
[GIN] 2024/08/15 - 21:32:49 | 200 |    3.6831915s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-15T21:33:18.287-03:00 level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
[GIN] 2024/08/15 - 21:33:22 | 200 |    4.1394721s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-15T21:34:18.586-03:00 level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
[GIN] 2024/08/15 - 21:34:22 | 200 |    3.8304122s |       127.0.0.1 | POST     "/api/chat"
rick-github commented 2 months ago

The ollama CLI recognizes that an image is attached by detecting filepaths in the prompt. A plain filename will be just a random string:

$ ollama run moondream:1.8b-v2-q4_0
>>> "IMG-20220330-WA0001.jpg"

 [0.62, 0.5, 0.69, 0.56]

>>> describe ./IMG-20220330-WA0001.jpg
Added image './IMG-20220330-WA0001.jpg'

The image is a newspaper article in Spanish that provides information about the registration of JBS, a Brazilian bank. The headline reads "JBS Registra Lucro de RS 20,5 bilhos em 2021 e faz o melhor ano da historia", which translates to "JBS registered $20,5 bilhos in 2021". The article also includes details about the company's financial situation and a list of their assets.

The text is written in Spanish, with some words appearing in both English and Portuguese. The layout of the newspaper page features a header at the top that reads "JBS Registra Lucro de RS 20,5 bilhos em 2021 e faz o melhor ano da historia", followed by a list of assets below it.
rick-github commented 2 months ago

Calling via the API seems to work as well, so if ollama-python is having issues, it might be a different problem.

echo '{"model": "moondream:1.8b-v2-q4_0","messages":[{"role":"user","content":"describe this image","images":["'"$(base64 IMG-20220330-WA0002.jpg)"'"]}],"stream":false}' | curl -s http://localhost:11434/api/chat -d @- | jq 
{
  "model": "moondream:1.8b-v2-q4_0",
  "created_at": "2024-08-16T01:08:40.0730596Z",
  "message": {
    "role": "assistant",
    "content": "\nThe image is a screenshot of an article from a newspaper, specifically the A Itudera do Futuro da Industria. The text on the screen reads \"A Atividade de detecao e futuro da industria\", which translates to \"An industrial action and future development\". The article is displayed in two columns with the left column being slightly larger than the right one, creating a balanced layout.\n\nThe newspaper's logo can be seen at the bottom of the image, indicating that it is a reliable source for news and information."
  },
  "done_reason": "stop",
  "done": true,
  "total_duration": 8297374661,
  "load_duration": 11746213,
  "prompt_eval_duration": 5410189000,
  "eval_count": 112,
  "eval_duration": 2800432000
}
carlosforster commented 2 months ago

Thank you very much Sir!!

Well, the CLI worked for the first file, but not for the second:

>>> "./IMG-20220126-WA0000.jpg"
Added image './IMG-20220126-WA0000.jpg'

The image shows a newspaper page with the headline "A descrito da industria do futuro e". The article is written
in Portuguese and discusses the future of industry. It also includes an image of a building, which could be
related to the topic being discussed. The text on the page appears to be in a foreign language, but it seems to
focus on the development of industries for the future.

>>> "./IMG-20220330-WA0001.jpg"
Added image './IMG-20220330-WA0001.jpg'

I don't want to take more of your time, but I'll put the part of my script that gets the image descriptions here:


client = ollama.Client(host='http://localhost:11434', timeout=30)

class Describer:
    instruction = 'Describe the image using strong key-words.'

    def describe(self, file_path):
        try:
            result = client.generate(
                model='moondream',
                prompt=self.instruction,
                images=[file_path],
                stream=False
            )['response']
            #img = Image.open(file_path, mode='r')
            #img = img.resize([int(i/1.2) for i in img.size])
            #img.show() 
            return result
        except:
            print("\n***TIMED_OUT*** "+file_path+"\n")
            return 'Timed out.'

describer = Describer()
rick-github commented 2 months ago

it does help to add instruction to the prompt, try describe this image: ./IMG-20220330-WA0001.jpg.

rick-github commented 2 months ago

Your python script worked fine for me for both images:

$ ./6365.py IMG-20220330-WA0001.jpg 

The image is a newspaper article in Spanish, titled "JBS Registra Lucro de RS 20,5 bilhos em 2021 e faz o melhor ano da historia". The headline of the article is prominently displayed at the top. The text is written in white against a dark background, making it easy to read and understand.

$ ./6365.py IMG-20220330-WA0002.jpg 

The image is a screenshot of an article from a newspaper, with the headline "A. Atividade do futuro da industria" prominently displayed at the top. The main body of the article is written in white text against a dark background, and it appears to be a summary or a news report about a new industrial project.

The article is divided into several paragraphs, with each paragraph containing different pieces of information related to the industrial project. The layout of the article seems to follow a typical newspaper format, with the headline at the top followed by the main body and then possibly some additional details or quotes at the bottom.
carlosforster commented 2 months ago

Yes. It worked indeed.

>>> Describe "./IMG-20220330-WA0001.jpg"
Added image './IMG-20220330-WA0001.jpg'

The image is a newspaper article in Spanish that provides information about the registration of JBS, a Brazilian
bank. The headline reads "JBS Registra Lucro de RS 20,5 bilhos em 2021 e faz o melhor ano da historia", which
translates to "JBS registered for $20,5 bilhos in 2021". The article also includes the names of the companies
involved and a link to their website.

The text is written in Spanish, with some words appearing in both English and Portuguese. The layout of the
newspaper page features a gray background, white text, and black text elements such as logos or bullet points.

Thanks again.

Another time, I'll try to figure out what my script is doing wrong (it failed in 80 images of about 3000 when I ran it a month ago) so we can be sure there is no real issue.

carlosforster commented 2 months ago

I started running again my script and the ollama server crashed after some images. Timed out in 30 seconds as requested, but I had to manually restart the server to continue

I found some images that even on CLI, moondream is not returning a description and not even a bounding box, but on CLI the ollama server doesn't seem to crash.

In the online moondream playground, it returns an acceptable description for the image that didn't work in my local setup. Maybe I'm using a quantized version in ollama vs probably the fp version online.

I'll have to further isolate the problem so we can look into it.

...
time=2024-08-16T12:34:24.272-03:00 level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
[GIN] 2024/08/16 - 12:34:25 | 200 |    946.0844ms |       127.0.0.1 | POST     "/api/generate"
time=2024-08-16T12:34:25.296-03:00 level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
[GIN] 2024/08/16 - 12:34:26 | 200 |    1.0440516s |       127.0.0.1 | POST     "/api/generate"
time=2024-08-16T12:34:26.354-03:00 level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
[GIN] 2024/08/16 - 12:34:27 | 200 |    831.2337ms |       127.0.0.1 | POST     "/api/generate"
time=2024-08-16T12:34:27.211-03:00 level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
[GIN] 2024/08/16 - 12:34:28 | 200 |     855.941ms |       127.0.0.1 | POST     "/api/generate"
time=2024-08-16T12:34:28.090-03:00 level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
[GIN] 2024/08/16 - 12:34:58 | 500 |   30.0216207s |       127.0.0.1 | POST     "/api/generate"
carlosforster commented 2 months ago

Looks like the problem is with quantized version of Moondream 2.

D:\ImageAnot>ollama run moondream:1.8b-v2-fp16
>>> Describe "./IMG-20200711-WA0005.jpg"
Added image './IMG-20200711-WA0005.jpg'

The image shows a screenshot of a webpage that displays text in both Spanish and English. The website is
organized with the main content written in white, while other information is presented in smaller, red font below
it. The page contains various columns containing different details about media and audio, with some columns
showing more detail than others. This combination of language elements and visual organization makes it an
accessible resource for a broader audience interested in multimedia and related subjects.

>>> /bye

D:\mageAnot>ollama run moondream
>>> Describe "./IMG-20200711-WA0005.jpg"
Added image './IMG-20200711-WA0005.jpg'

>>>
carlosforster commented 2 months ago

Tested with 2967 images with moondream fp16: 44 of them had an empty string as output. The server didn't crash this time like it would happen with the 4-bit quantized version of the model.