icereed / paperless-gpt

Use LLMs and LLM Vision to handle paperless-ngx
MIT License
77 stars 4 forks source link

Feature Request: use ollama to redo/improve the OCR #20

Open thiswillbeyourgithub opened 1 month ago

thiswillbeyourgithub commented 1 month ago

Hi,

I've been playing around with using ollama to generate OCR content and was wondering if you were planning on adding the ability to use vision LLM as OCR content for paperless.

I am following several projects that allow using LLM for paperless but none so far use vision LLMs. I recently made a post in an issue for doclytics here where I outline briefly what's currently possible.

As paperless-gpt seems to have the workflow that best fit my setup, I was wondering what you think of this. That's really the only missing aspect for me to take the jump!

Have a nice day.

icereed commented 1 month ago

Hi @thiswillbeyourgithub, I’m very interested to give it a try. I image a third checkbox in the UI saying “Enhance OCR” or similar. What do you think?

thiswillbeyourgithub commented 1 month ago

Glad you're interested!

Not bad idea. So the butons would be "refresh" "generate tags" and "enhance OCR"? Maybe sometime it would make sense to run both at the same time, so a fifthbutton "both" would make sense. Or rather instead of buttons use tick boxes and a single button "Execute"?

Btw an update just drop that might motivate you even further :) https://github.com/B-urb/doclytics/issues/97#issuecomment-2425299864

icereed commented 1 month ago

Nice! I guess, the first step is to get the images for each page from the paperless API. I don't know if there is a document-to-image API already or if we have to convert PDF to PNG or similar.

icereed commented 1 month ago

Converting PDF to images in go doesn't seem to be complicated. Additionally, go is very fast: https://github.com/Mindinventory/Golang-PDF-to-Image-Converter

thiswillbeyourgithub commented 1 month ago

I checked a bit and didn't see anything to get the image from the doc directly.

Though I did stumble upon things related to document history, Do you know if paperless is able to remember the history of documents? Because if so, it would be very handy to store the new OCR content without dealing with XML tags too much. What I mean is that overwriting the content with the new OCR is not a big deal if it's in the history anyway. What do you think? https://docs.paperless-ngx.com/api/#file-uploads

Also just checking, do you store logs of your app through a file? That would be quite reassuring to know that there is always a way to backtrack.

icereed commented 1 month ago

I just checked and indeed, my paperless is showing the edits made by paperless-gpt in the document history. That's neat!

Currently, the logs are pushed to stdout. With log files come a few challenges regarding log rotation and not spamming the disk of the users. But I do see the point of having a way to backtrack changes.

icereed commented 3 weeks ago

@thiswillbeyourgithub , if you are playful with go you can test around with an early prototype :)

https://github.com/icereed/paperless-gpt/pull/29/files#r1817924426

thiswillbeyourgithub commented 3 weeks ago

Hey thanks a bunch for giving this a try. Unfortunately the only thing I know about go is its name so I won't be of any help at this stage :/

I'm still keeping an eye on this project to know when I can give it a spin and I can't wait :) !

icereed commented 3 weeks ago

@thiswillbeyourgithub now you can try an early prototype:

It can be used with docker tag icereed/paperless-gpt:unreleased.

I added the description of the envs with example: https://github.com/icereed/paperless-gpt?tab=readme-ov-file#docker-compose

You can also adjust the prompt for vision. Happy to get feedback on how the prompt can get enhanced.

icereed commented 3 weeks ago

Ah, sorry, the docker push did not work. The image is building now and should be available in a few minutes. Sorry 😅

mikekaldig commented 3 weeks ago

One Question, can we use this config:

version: '3.8'
services:

  paperless-gpt:
    #image: icereed/paperless-gpt:v0.4.0
    image:  icereed/paperless-gpt:unreleased
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      PAPERLESS_BASE_URL: 'http://10.10.10.122:8777'
      PAPERLESS_API_TOKEN: 'xxxx'
      LLM_PROVIDER: 'openai' # or 'ollama'
      LLM_MODEL: 'gpt-4o-mini'     # or 'llama2'
      OPENAI_API_KEY: 'xxx' # Required if using OpenAI
      LLM_LANGUAGE: 'German' # Optional, default is 'English'
      OLLAMA_HOST: http://10.10.10.251:11434 # Useful if using Ollama
      VISION_LLM_PROVIDER: 'ollama' # Optional, for OCR
      VISION_LLM_MODEL: 'x/llama3.2-vision:latest' # Optional, for OCR
    ports:
      - '8080:8080'
#    depends_on:
#      - webserver
    volumes:
#      - /share/Container/paperlessgpt:/
      - ./prompts:/app/prompts # Mount the prompts directory

or it is only possible to use ollama or openai?

icereed commented 3 weeks ago

Yes, you can mix. I personally use ollama for vision and openai for the other suggestions.

mikekaldig commented 3 weeks ago

It seems as if only OpenAI is being used. Ollama is not loading. I used this prompt:

I will provide an image of a document with partially read content (from OCR). Your task is to analyze the image, extract key information, and determine an appropriate title for the document.
Respond only with the title in {{.Language}} and add a fitting emoji at the beginning of the title for visual appeal.

Image: {{Image}}

Ensure the title is in {{.Language}} and effectively represents the document’s content.

any idea?

icereed commented 3 weeks ago

So usually, when VISION_LLM_PROVIDER and VISION_LLM_MODEL values are set, the UI will show this button:

image

The OCR screen looks like this:

image

You need to enter a document ID.

The result will look like this:

image
icereed commented 3 weeks ago

P.S: @mikekaldig be sure to run docker compose pull first, the image is hot and fresh :)

mikekaldig commented 3 weeks ago

okay .) fixed for me. But there is a error msg: Screenshot 2024-10-28 213830_222

mikekaldig commented 3 weeks ago

http://10.10.10.122:8777/documents/325/details

the document exist.

icereed commented 3 weeks ago

This might only be a UI issue, sorry. Can you still click on the button? It should work.

icereed commented 3 weeks ago

P.S: You need to pull the specified ollama model first using ollama pull x/llama…

mikekaldig commented 3 weeks ago

Yeah, its only a ui bug. i can click the button:

Great work :)

Screenshot 1

icereed commented 3 weeks ago

Kudos to @thiswillbeyourgithub for bringing in this idea. 💡 The next step is to integrate the OCR feature into a nice workflow. The OCR takes quite a time on my machine (60s per page minimum). In the backend I already designed OCR as an asynchronous job/queue architecture. Maybe that’s something we can put into the background…

mikekaldig commented 3 weeks ago

I just ran a page with completed forms through Ollama 3.2 Vision. The result is noticeably better than standard OCR, especially with the handwritten sections, which are almost perfectly recognized. I think the prompting needs a bit of fine-tuning, but the result is fantastic.

JonasHess commented 3 weeks ago

Nice work!

icereed commented 3 weeks ago

Just found out, if you want to run x/llama3.2-vision:latest you need at least ollama v0.4.0 (only release canditate at the moment): https://github.com/ollama/ollama/releases/tag/v0.4.0-rc5

icereed commented 3 weeks ago

I just ran a page with completed forms through Ollama 3.2 Vision. The result is noticeably better than standard OCR, especially with the handwritten sections, which are almost perfectly recognized. I think the prompting needs a bit of fine-tuning, but the result is fantastic.

@mikekaldig how fast/slow was the OCR for you and what hardware do you use to run Ollama?

JonasHess commented 3 weeks ago

With my Nvidia Quadro P400 (only 2GB VRAM) it took about 3-4 minutes per page at 300 DPI.

icereed commented 3 weeks ago

okay .) fixed for me. But there is a error msg: Screenshot 2024-10-28 213830_222

This bug is fixed now with #37 🙂

thiswillbeyourgithub commented 3 weeks ago

Great work it seems! Unfortunately the timing is as bad as can be for me so I don't wait for me to make a review :)

Have you figured out a way to periodically rerun the OCR as better models and prompts are rolled out? For example say I scan my entire collection with the model minicpm, how easy is it to cleanly re OCR my entire collection with llama 3.2 when it's out of beta?

mikekaldig commented 3 weeks ago

I just ran a page with completed forms through Ollama 3.2 Vision. The result is noticeably better than standard OCR, especially with the handwritten sections, which are almost perfectly recognized. I think the prompting needs a bit of fine-tuning, but the result is fantastic.

@mikekaldig how fast/slow was the OCR for you and what hardware do you use to run Ollama?

Hey, i use a Nvidia P40 - Ollama 0.40r5 with Ollama 3.2 Vision, it needs ~ 60 sec for one page

`Every 2.0s: nvidia-smi server1: Tue Oct 29 19:31:17 2024

Tue Oct 29 19:31:17 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla P40 Off | 00000000:01:00.0 Off | Off | | N/A 52C P0 53W / 250W | 11350MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3424810 C ...unners/cuda_v12/ollama_llama_server 11348MiB | +---------------------------------------------------------------------------------------+ WARNING: infoROM is corrupted at gpu 0000:01:00.0

`

icereed commented 3 weeks ago

Great work it seems! Unfortunately the timing is as bad as can be for me so I don't wait for me to make a review :)

Have you figured out a way to periodically rerun the OCR as better models and prompts are rolled out? For example say I scan my entire collection with the model minicpm, how easy is it to cleanly re OCR my entire collection with llama 3.2 when it's out of beta?

So that’s the tricky question: do we want to control this via a tag? Something like paperless-gpt-ocr? If yes, do we want to manually kick off the OCR batch processing via a UI to review before OCR does stuff? Or rather do we want to automatically be pulled into a job queue where one document gets OCR’ed after another?

mikekaldig commented 3 weeks ago

I think every user probably has a different way of handling this. Personally, I prefer it to be processed sequentially in the background. But maybe you could alternatively implement a switch function that allows for personalization. I also think you're probably now getting into the topic of whether there should be a settings area. :)

icereed commented 3 weeks ago

Another thing I wonder: some documents have perfect contents already since they come in as digital documents (PDFs). We could theoretically let an LLM decide if it’s worth it to do an OCR again. But that could also be too much unexpected magic for the users…

thiswillbeyourgithub commented 3 weeks ago

Personally, my use case would be to run the entire collection into it sequentially in the background. and run the whole thing all over again when there is a new model

JonasHess commented 3 weeks ago

Personally, my use case would be to run the entire collection into it sequentially in the background. and run the whole thing all over again when there is a new model

This is my use case as well. I also don't really have a need for the web ui.

icereed commented 3 weeks ago

So basically, a tag would be sufficient then as an MVP. Just tag all documents that you wanna OCR and paperless-gpt will put them into the job queue.

JonasHess commented 3 weeks ago

So basically, a tag would be sufficient then as an MVP. Just tag all documents that you wanna OCR and paperless-gpt will put them into the job queue.

It should also add a custom tag when it is done, so another pipeline can be triggered after.

icereed commented 3 weeks ago

Interesting, tag chaining 😄 I will give it a thought.

thiswillbeyourgithub commented 3 weeks ago

So basically, a tag would be sufficient then as an MVP. Just tag all documents that you wanna OCR and paperless-gpt will put them into the job queue.

I agree ideally customizable because paperless-gpt is quite long and can clutter some paperless UI I'd say

thiswillbeyourgithub commented 3 weeks ago

Also how is handled the fact that some documents have some fields manually edited already? Does it overwrite them? Ignore them?

thiswillbeyourgithub commented 2 weeks ago

Btw ollama 0.4 just dropped and has support for larger llama3.2 vision models