[ingest] Improve reliability of Tesseract OCR by image splitting

mediar-ai / screenpipe

rewind.ai x cursor.com = your AI assistant that has all the context. 24/7 screen & voice recording for the age of super intelligence. get your data ready or be left behind

https://screenpi.pe

MIT License

8.92k stars 511 forks source link

[ingest] Improve reliability of Tesseract OCR by image splitting #22

Closed PSU3D0 closed 2 months ago

PSU3D0 commented 4 months ago

Tesseract was trained on documents. Passing entire 1920x1080 screenshot will yield dubious performance.

Screenshots should be split into 3:4 aspect ratio images, ran in batches through tesseract, then recombined. This should result in much better OCR performance.

louis030195 commented 4 months ago

indeed

was also looking at https://huggingface.co/microsoft/trocr-base-printed

https://github.com/huggingface/candle/tree/main/candle-examples/examples/trocr

which seem to be trained on small images too

been considering getting rid of tesseract altogether at some point

also ideally OCR would use NPU (like mac m3, NVIDIA, whatever is here) i think?

more perf wise, also was considering how to best tradeoff online/batch OCR (depending on the user need for real time data)

also this https://huggingface.co/adept/fuyu-8b is more fore screenshot thing

keep thiking the best would be multimodal model that are good at screens + multimodal RAG

what do you think about using raw models like trocr instead of tesseract?

louis030195 commented 4 months ago

https://github.com/robertknight/ocrs

PSU3D0 commented 4 months ago

@louis030195 I think supporting multiple providers via an interface pattern would be best. Tesseract has been around for a long time. What does performance on trocr look like?

louis030195 commented 4 months ago

good idea yes the plan is to support multiple options for different part of the project

dont know trocr, seems to be the most popular ocr model atm

dropping this

https://www.youtube.com/watch?v=C02MC-uCtoI

noticed tesseract was using lot of cpu

changed a few things (using mp3 encoding instead of wav for 10x less storage) and dont see tesseract again in cpu usage somehow