simonw / llm-mistral

LLM plugin providing access to Mistral models using the Mistral API
Apache License 2.0
149 stars 14 forks source link

Support for pixtral using LLM 0.17 attachments #12

Closed simonw closed 2 days ago

simonw commented 2 days ago

https://docs.mistral.ai/capabilities/vision/

Using:

simonw commented 2 days ago

In mistral_models.json:

{
      "id": "pixtral-12b-2409",
      "object": "model",
      "created": 1729104658,
      "owned_by": "mistralai",
      "name": "pixtral-12b-2409",
      "description": "Official pixtral-12b-2409 Mistral AI model",
      "max_context_length": 131072,
      "aliases": [
        "pixtral-12b",
        "pixtral-12b-latest"
      ],
      "deprecation": null,
      "capabilities": {
        "completion_chat": true,
        "completion_fim": false,
        "function_calling": true,
        "fine_tuning": false,
        "vision": true
      },
      "type": "base"
    }

So we can look for "capabilities": {"vision": true}

simonw commented 2 days ago

Got it working, it's interesting though:

llm -m pixtral-12b 'return just the text' -a ../llm/example.jpg
Example handwriting
Let's try this out

It's quite varied in its response:

% llm -m pixtral-12b 'ocr' -a ../llm/example.jpg 

Example handwriting Let's try this out

% llm -m pixtral-12b 'ocr' -a ../llm/example.jpg
Certainly! Here is the OCR (Optical Character Recognition) output for the image provided:

---

**Example handwriting**

Let's try this out

---
% llm -m pixtral-12b 'ocr' -a ../llm/example.jpg
Sure, here's an example of handwriting:

**Example handwriting**

Let's try this out

And I got this at one point, with a system prompt:

llm -m pixtral-12b 'ocr' -a https://static.simonwillison.net/static/2024/example-handwriting.jpg --system 'return just the text'
```python
{
  "ocr": [
    {
      "text": "Example handwriting",
      "bounding_box": {
        "top": 52.72,
        "left": 153.68,
        "width": 150.64,
        "height": 36.88
      }
    },
    {
      "text": "Let's try this out",
      "bounding_box": {
        "top": 99.76,
        "left": 149.76,
        "width": 145.48,
        "height": 36.88
      }
    }
  ]
}
simonw commented 2 days ago

Surprising error:

llm -m pixtral-12b 'what species is this?' -a ../llm/demo-pics/cat.jpeg 

Error: 500: invalid_request_error - Image data:image/jpeg;base64,/9j/4AA... has an invalid format.Allowed formats are JPEG,PNG,WEBP,GIF.


It's a real JPEG though.
simonw commented 2 days ago

OK, not sure why but I think this is a Pixtral bug:

llm -m pixtral-12b 'describe' -a https://static.simonwillison.net/static/2024/rocks.jpeg

Error: 500: invalid_request_error - Image https://static.simonwillison.n has an invalid format.Allowed formats are JPEG,PNG,WEBP,GIF.

https://static.simonwillison.net/static/2024/rocks.jpeg

But it's a valid JPEG. And this one works: https://static.simonwillison.net/static/2024/earth.jpg

llm -m pixtral-12b 'describe' -a https://static.simonwillison.net/static/2024/earth.jpg

The image shows a large screen displaying an educational interface related to Earth and its surface composition. The screen features a prominent image of the Earth, highlighting the continents and oceans. To the right of the Earth image, there is a pie chart illustrating the composition of Earth's surface, with sections labeled "Land," "Water," and "Ice." The interface at the top of the screen appears to be a browser or software dashboard with tabs open, such as "Earth" and "Surface Composition."

The background of the interface includes additional tabs, likely representing other topics or sections within the educational software. The environment surrounding the screen seems to be a modern indoor setting, possibly an educational facility or museum, given the high-quality visual display and the structured layout. There are also some plants visible at the bottom of the image, contributing to the indoor aesthetics.

simonw commented 2 days ago

Reported it as an issue:

simonw commented 2 days ago

Conversation works:

% llm -m pixtral-12b 'describe this image in three words' -a https://static.simonwillison.net/static/2024/earth.jpg
Screen displaying Earth's surface composition
% llm -c 'now more detail'
The image shows a large screen displaying an interactive educational interface about Earth's surface composition. The screen features a detailed visualization of the Earth with clearly marked continents and oceans in natural colors. Adjacent to the Earth image is a pie chart illustrating the percentages of different components of Earth's surface. The interface at the top of the screen includes several tabs labeled with topics such as "Solar System," "Galaxy," "Neutron Stars," "Black Holes," and "Star Systems," indicating a broader range of accessible scientific content. The screen is mounted on a wall with a wooden slatted design at the bottom, and there are some green plants visible on the sides.

It did hallucinate the tabs though.