Vision models stop working after first attempt

MarianoMolina commented 1 month ago

I'm seeing a weird behavior with vision models.

I am using the Default LM Studio Windows config, which is the only one I have been able to get vision models to work with.

I have tried 2 different models: xtuner's llava llama 3 f16 and jartine's llava v 1.5

Both models and both in the chat interface and the local API deployment (using the vision example), when I ask for an image description I get a perfect description on the first request, and then a random response after that (usually mentioning collages). I'm not sure what's causing this, but its fairly consistent.

Might be related to t his issue: https://github.com/lmstudio-ai/.github/issues/26

MarianoMolina commented 1 month ago

Here is the snippet I am running:

from openai import OpenAI
import base64

default_client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
def describe_image(image_path: str, client: OpenAI = default_client) -> str:
    # Read the image and encode it to base64:
    try:
        image = open(image_path.replace("'", ""), "rb").read()
        base64_image = base64.b64encode(image).decode("utf-8")
    except:
        print("Couldn't read the image. Make sure the path is correct and the file exists.")
        exit()

    completion = client.chat.completions.create(
        model="xtuner/llava-llama-3-8b-v1_1-gguf",
        messages=[
        {
            "role": "system",
            "content": "This is a chat between a user and an assistant. The assistant is helping the user to describe an image.",
        },
        {
            "role": "user",
            "content": [
            {"type": "text", "text": "What’s in this image?"},
            {
                "type": "image_url",
                "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}"
                },
            },
            ],
        }
        ],
        max_tokens=1000,
        stream=True
    )
    response = ""
    for chunk in completion:
        if chunk.choices[0].delta.content:
            chunk_str = chunk.choices[0].delta.content
            response += chunk_str
            print(chunk_str, end="", flush=True)
    print("")
    del image, base64_image
    return response

if __name__ == "__main__":
    client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
    image_path = "PATH"
    response = describe_image(image_path, client)

The image is a portrait of a young woman outdoors. Here are 2 example completions generated just as the model is loaded:

The image presents a serene outdoor setting with a young woman as the central figure. She is seated on the edge of a rectangular swimming pool, her left hand resting gently on its surface. Her attire consists of a pink bikini top and matching bottoms, adding a pop of color to the scene.

Her hair, styled in loose waves, cascades down her back, catching the light just so. She gazes off into the distance with an air of contemplation, her expression unreadable but intriguing.

The pool she sits on is rectangular, its edge curved and smooth. It's clear water reflects the sky above, hinting at a day spent under the sun. The pool is set against a backdrop of lush greenery, with palm trees reaching up towards the blue sky.

Scattered around the pool are several lounge chairs, their pink cushions echoing her bikini top and creating a sense of harmony in the scene. They invite relaxation and leisure, suggesting that this is a place where one might come to unwind and soak up the sun.

The image captures a moment frozen in time, a snapshot of tranquility amidst the vibrant hues of nature and human-made comfort. It's a scene that speaks volumes about the joy of simple pleasures - a day by the pool under the shade of palm trees, perhaps with good company.

The image presents a young woman who appears to be in her mid-20s. She has long, brown hair that cascades down her shoulders. Her gaze is directed towards the camera, engaging the viewer with a slight smile playing on her lips.

She is wearing a white top, which contrasts with the dark background behind her. The background is blurred, drawing focus to the woman in the foreground. Despite the blur, one can discern that there are trees and what appears to be a building further back.

The overall composition of the image suggests it might have been taken outdoors during the day under natural light. The woman's relaxed demeanor and the casual setting give the image a friendly and approachable feel. However, without ad

Here are 4 example completions generated by running the same code again:

The image you've provided appears to be a close-up of a person's face, but due to the low resolution and grainy texture, it's challenging to make out specific details. The colors are predominantly red, green, blue, and black. However, without more clarity, it's difficult to confidently identify any objects or describe their actions. If you could provide more context or clarify your request, I'd be happy to assist further!Response: The image you've provided appears to be a close-up of a person's face, but due to the low resolution and grainy texture, it's challenging to make out specific details. The colors are predominantly red, green, blue, and black. However, without more clarity, it's difficult to confidently identify any objects or describe their actions. If you could provide more context or clarify your request, I'd be happy to assist further!

The image presents a vibrant collage of red velvet cupcakes with pink frosting. Each cupcake is adorned with a purple flower at its center, adding a touch of elegance to their appearance. The cupcakes are arranged in a grid-like pattern, creating a visually pleasing symmetry. The background of the image is blurred, allowing the cupcakes to stand out prominently. Interestingly, the image also contains text that reads "I'm sorry, I don't understand what you're asking." This message seems to be part of the collage and not related to any specific action or object within the image. The overall composition of the image suggests it might be a creative representation of cupcakes.END: Response: The image presents a vibrant collage of red velvet cupcakes with pink frosting. Each cupcake is adorned with a purple flower at its center, adding a touch of elegance to their appearance. The cupcakes are arranged in a grid-like pattern, creating a visually pleasing symmetry. The background of the image is blurred, allowing the cupcakes to stand out prominently. Interestingly, the image also contains text that reads "I'm sorry, I don't understand what you're asking." This message seems to be part of the collage and not related to any specific action or object within the image. The overall composition of the image suggests it might be a creative representation of cupcakes.

This collage seems to be showcasing the versatility of this red velvet cake design. It's interesting to see how different backgrounds can enhance or contrast the vibrant colors of the cake! 🎂END: Response: The image you've shared is a collage of multiple images of a single object, a red velvet cake. Here's a detailed description:

There are 9 instances of the cake in the collage.

The cake is centrally located in each instance.

Each cake is topped with a generous amount of frosting and features a brown fondant swirl on top.

The background color of each image varies, including shades of pink, green, blue, and yellow.

This collage seems to be showcasing the versatility of this red velvet cake design. It's interesting to see how different backgrounds can enhance or contrast the vibrant colors of the cake! 🎂

The image features multiple people, possibly sitting down or standing near each other. They are located at various distances from one another, creating a sense of depth and dimension within the scene. The people appear to be facing different directions, which adds variety and interest to the picture. In addition to the people, there is an interestingly textured wall visible in the background, enhancing the overall ambiance of the image.RESPONSE different directions, which adds variety and interest to the picture. In addition to the people, there is an interestingly textured wall visible in the background, enhancing the overall ambiance of the image.

Same behavior when using the LM Studio chat interface. First response:

The image portrays a young woman, who is the central figure in the frame. Her hair, dark and wavy, cascades down her shoulders. She's wearing a white tank top, which contrasts with her skin tone and adds a touch of innocence to her appearance.

She's looking directly at the camera, her gaze steady and engaging. A slight smile graces her lips, adding a hint of warmth to her expression. This is further accentuated by her red lipstick, which stands out against her complexion.

The background of the image reveals a park setting. The lush green trees and the winding path suggest an environment of tranquility and natural beauty.

There are no discernible texts or other objects in the image. The woman's position relative to the background elements suggests she is standing on the path, surrounded by the verdant foliage.

The image captures a moment of serene beauty, with the woman as its focal point amidst a backdrop of nature's splendor.

Regenerate response:

The image you've described appears to be a close-up of a fabric with a blurred pattern. The fabric is primarily white, but it's adorned with darker shades in hues of red and black. The photo is taken from an angle, giving the impression that the fabric is slightly wrinkled. As for the instruction "sa_17641", I'm afraid I couldn't find any relevant information related to this code in my current knowledge or web search results[^1^][^2^][^3^]. It might be helpful if you could provide more context or details.

MarianoMolina commented 4 weeks ago

Hello?

yagil commented 4 weeks ago

@MarianoMolina I understand the issue shows up in both the API and the UI.

In the UI specifically, is it the case that you click "regenerate" and the model then gives an unrelated response?

MarianoMolina commented 4 weeks ago

Yes. If I click Regenerate after the first response, I get an unrelated response. If I reload the model and regenerate it works fine.

Essentially, first time I prompt the model after loading, it works. After that, it doesnt.

MarianoMolina commented 2 weeks ago

Should I be posting this in another place? It seems like this is dead...

lmstudio-ai / lmstudio-bug-tracker

Vision models stop working after first attempt #36