gptscript-ai / gpt4-v-vision

8 stars 4 forks source link

Screenshot to structured data #11

Open kaihendry opened 5 months ago

kaihendry commented 5 months ago

Inspired by https://youtu.be/g3NtJatmQR0?t=133 I was hoped to turn the screenshot of www texasregionalradio com_charts_Week142024Top100 html

into structured JSON with the prompt:

tools: github.com/gptscript-ai/vision

Read billboard.png which is a billboard table of songs.

Each row should be converted into a JSON object.

Keys should at least include:
* Wks_on_chart
* Title
* Artist

However it doesn't work saying it essentially can't ready text from an image.

njhale commented 5 months ago

@kaihendry Thanks for submitting this issue!

By default, the gpt4-v-vision tool uses OpenAI's gpt-4-turbo model (previously known as gpt-4-vision-preview) to interpret images. Skimming through the OpenAI docs, I didn't see anything mentioning OCR-related limitations specifically, but I did find a community thread where folks were encountering similar issues. In that thread it looks like it has become increasingly difficult to get decent OCR results via OpenAI's API and model. At the moment, it's unclear to me what OpenAI's official level of support is for OCR

We always have the option of writing another vision tool for a non-OpenAI model if we can find one with better OCR support too.

In the meantime, when I get the chance I'll try to repro your issue.