Open thiswillbeyourgithub opened 1 month ago
Hi @thiswillbeyourgithub, I’m very interested to give it a try. I image a third checkbox in the UI saying “Enhance OCR” or similar. What do you think?
Glad you're interested!
Not bad idea. So the butons would be "refresh" "generate tags" and "enhance OCR"? Maybe sometime it would make sense to run both at the same time, so a fifthbutton "both" would make sense. Or rather instead of buttons use tick boxes and a single button "Execute"?
Btw an update just drop that might motivate you even further :) https://github.com/B-urb/doclytics/issues/97#issuecomment-2425299864
Nice! I guess, the first step is to get the images for each page from the paperless API. I don't know if there is a document-to-image API already or if we have to convert PDF to PNG or similar.
Converting PDF to images in go doesn't seem to be complicated. Additionally, go is very fast: https://github.com/Mindinventory/Golang-PDF-to-Image-Converter
I checked a bit and didn't see anything to get the image from the doc directly.
Though I did stumble upon things related to document history, Do you know if paperless is able to remember the history of documents? Because if so, it would be very handy to store the new OCR content without dealing with XML tags too much. What I mean is that overwriting the content with the new OCR is not a big deal if it's in the history anyway. What do you think? https://docs.paperless-ngx.com/api/#file-uploads
Also just checking, do you store logs of your app through a file? That would be quite reassuring to know that there is always a way to backtrack.
I just checked and indeed, my paperless is showing the edits made by paperless-gpt in the document history. That's neat!
Currently, the logs are pushed to stdout. With log files come a few challenges regarding log rotation and not spamming the disk of the users. But I do see the point of having a way to backtrack changes.
@thiswillbeyourgithub , if you are playful with go you can test around with an early prototype :)
https://github.com/icereed/paperless-gpt/pull/29/files#r1817924426
Hey thanks a bunch for giving this a try. Unfortunately the only thing I know about go is its name so I won't be of any help at this stage :/
I'm still keeping an eye on this project to know when I can give it a spin and I can't wait :) !
@thiswillbeyourgithub now you can try an early prototype:
It can be used with docker tag icereed/paperless-gpt:unreleased.
I added the description of the envs with example: https://github.com/icereed/paperless-gpt?tab=readme-ov-file#docker-compose
You can also adjust the prompt for vision. Happy to get feedback on how the prompt can get enhanced.
Ah, sorry, the docker push did not work. The image is building now and should be available in a few minutes. Sorry 😅
One Question, can we use this config:
version: '3.8'
services:
paperless-gpt:
#image: icereed/paperless-gpt:v0.4.0
image: icereed/paperless-gpt:unreleased
build:
context: .
dockerfile: Dockerfile
environment:
PAPERLESS_BASE_URL: 'http://10.10.10.122:8777'
PAPERLESS_API_TOKEN: 'xxxx'
LLM_PROVIDER: 'openai' # or 'ollama'
LLM_MODEL: 'gpt-4o-mini' # or 'llama2'
OPENAI_API_KEY: 'xxx' # Required if using OpenAI
LLM_LANGUAGE: 'German' # Optional, default is 'English'
OLLAMA_HOST: http://10.10.10.251:11434 # Useful if using Ollama
VISION_LLM_PROVIDER: 'ollama' # Optional, for OCR
VISION_LLM_MODEL: 'x/llama3.2-vision:latest' # Optional, for OCR
ports:
- '8080:8080'
# depends_on:
# - webserver
volumes:
# - /share/Container/paperlessgpt:/
- ./prompts:/app/prompts # Mount the prompts directory
or it is only possible to use ollama or openai?
Yes, you can mix. I personally use ollama for vision and openai for the other suggestions.
It seems as if only OpenAI is being used. Ollama is not loading. I used this prompt:
I will provide an image of a document with partially read content (from OCR). Your task is to analyze the image, extract key information, and determine an appropriate title for the document.
Respond only with the title in {{.Language}} and add a fitting emoji at the beginning of the title for visual appeal.
Image: {{Image}}
Ensure the title is in {{.Language}} and effectively represents the document’s content.
any idea?
So usually, when VISION_LLM_PROVIDER and VISION_LLM_MODEL values are set, the UI will show this button:
The OCR screen looks like this:
You need to enter a document ID.
The result will look like this:
P.S: @mikekaldig be sure to run docker compose pull
first, the image is hot and fresh :)
okay .) fixed for me. But there is a error msg:
http://10.10.10.122:8777/documents/325/details
the document exist.
This might only be a UI issue, sorry. Can you still click on the button? It should work.
P.S: You need to pull the specified ollama model first using ollama pull x/llama…
Yeah, its only a ui bug. i can click the button:
Great work :)
Kudos to @thiswillbeyourgithub for bringing in this idea. 💡 The next step is to integrate the OCR feature into a nice workflow. The OCR takes quite a time on my machine (60s per page minimum). In the backend I already designed OCR as an asynchronous job/queue architecture. Maybe that’s something we can put into the background…
I just ran a page with completed forms through Ollama 3.2 Vision. The result is noticeably better than standard OCR, especially with the handwritten sections, which are almost perfectly recognized. I think the prompting needs a bit of fine-tuning, but the result is fantastic.
Nice work!
Just found out, if you want to run x/llama3.2-vision:latest
you need at least ollama v0.4.0 (only release canditate at the moment): https://github.com/ollama/ollama/releases/tag/v0.4.0-rc5
I just ran a page with completed forms through Ollama 3.2 Vision. The result is noticeably better than standard OCR, especially with the handwritten sections, which are almost perfectly recognized. I think the prompting needs a bit of fine-tuning, but the result is fantastic.
@mikekaldig how fast/slow was the OCR for you and what hardware do you use to run Ollama?
With my Nvidia Quadro P400 (only 2GB VRAM) it took about 3-4 minutes per page at 300 DPI.
okay .) fixed for me. But there is a error msg:
This bug is fixed now with #37 🙂
Great work it seems! Unfortunately the timing is as bad as can be for me so I don't wait for me to make a review :)
Have you figured out a way to periodically rerun the OCR as better models and prompts are rolled out? For example say I scan my entire collection with the model minicpm, how easy is it to cleanly re OCR my entire collection with llama 3.2 when it's out of beta?
I just ran a page with completed forms through Ollama 3.2 Vision. The result is noticeably better than standard OCR, especially with the handwritten sections, which are almost perfectly recognized. I think the prompting needs a bit of fine-tuning, but the result is fantastic.
@mikekaldig how fast/slow was the OCR for you and what hardware do you use to run Ollama?
Hey, i use a Nvidia P40 - Ollama 0.40r5 with Ollama 3.2 Vision, it needs ~ 60 sec for one page
`Every 2.0s: nvidia-smi server1: Tue Oct 29 19:31:17 2024
Tue Oct 29 19:31:17 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla P40 Off | 00000000:01:00.0 Off | Off | | N/A 52C P0 53W / 250W | 11350MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 3424810 C ...unners/cuda_v12/ollama_llama_server 11348MiB | +---------------------------------------------------------------------------------------+ WARNING: infoROM is corrupted at gpu 0000:01:00.0
`
Great work it seems! Unfortunately the timing is as bad as can be for me so I don't wait for me to make a review :)
Have you figured out a way to periodically rerun the OCR as better models and prompts are rolled out? For example say I scan my entire collection with the model minicpm, how easy is it to cleanly re OCR my entire collection with llama 3.2 when it's out of beta?
So that’s the tricky question: do we want to control this via a tag? Something like paperless-gpt-ocr
?
If yes, do we want to manually kick off the OCR batch processing via a UI to review before OCR does stuff? Or rather do we want to automatically be pulled into a job queue where one document gets OCR’ed after another?
I think every user probably has a different way of handling this. Personally, I prefer it to be processed sequentially in the background. But maybe you could alternatively implement a switch function that allows for personalization. I also think you're probably now getting into the topic of whether there should be a settings area. :)
Another thing I wonder: some documents have perfect contents already since they come in as digital documents (PDFs). We could theoretically let an LLM decide if it’s worth it to do an OCR again. But that could also be too much unexpected magic for the users…
Personally, my use case would be to run the entire collection into it sequentially in the background. and run the whole thing all over again when there is a new model
Personally, my use case would be to run the entire collection into it sequentially in the background. and run the whole thing all over again when there is a new model
This is my use case as well. I also don't really have a need for the web ui.
So basically, a tag would be sufficient then as an MVP. Just tag all documents that you wanna OCR and paperless-gpt will put them into the job queue.
So basically, a tag would be sufficient then as an MVP. Just tag all documents that you wanna OCR and paperless-gpt will put them into the job queue.
It should also add a custom tag when it is done, so another pipeline can be triggered after.
Interesting, tag chaining 😄 I will give it a thought.
So basically, a tag would be sufficient then as an MVP. Just tag all documents that you wanna OCR and paperless-gpt will put them into the job queue.
I agree ideally customizable because paperless-gpt is quite long and can clutter some paperless UI I'd say
Also how is handled the fact that some documents have some fields manually edited already? Does it overwrite them? Ignore them?
Btw ollama 0.4 just dropped and has support for larger llama3.2 vision models
Hi,
I've been playing around with using ollama to generate OCR content and was wondering if you were planning on adding the ability to use vision LLM as OCR content for paperless.
I am following several projects that allow using LLM for paperless but none so far use vision LLMs. I recently made a post in an issue for doclytics here where I outline briefly what's currently possible.
As paperless-gpt seems to have the workflow that best fit my setup, I was wondering what you think of this. That's really the only missing aspect for me to take the jump!
Have a nice day.