Closed PSU3D0 closed 2 months ago
indeed
was also looking at https://huggingface.co/microsoft/trocr-base-printed
https://github.com/huggingface/candle/tree/main/candle-examples/examples/trocr
which seem to be trained on small images too
been considering getting rid of tesseract altogether at some point
also ideally OCR would use NPU (like mac m3, NVIDIA, whatever is here) i think?
more perf wise, also was considering how to best tradeoff online/batch OCR (depending on the user need for real time data)
also this https://huggingface.co/adept/fuyu-8b is more fore screenshot thing
keep thiking the best would be multimodal model that are good at screens + multimodal RAG
what do you think about using raw models like trocr
instead of tesseract?
@louis030195 I think supporting multiple providers via an interface pattern would be best. Tesseract has been around for a long time. What does performance on trocr look like?
good idea yes the plan is to support multiple options for different part of the project
dont know trocr, seems to be the most popular ocr model atm
dropping this
https://www.youtube.com/watch?v=C02MC-uCtoI
noticed tesseract was using lot of cpu
changed a few things (using mp3 encoding instead of wav for 10x less storage) and dont see tesseract again in cpu usage somehow
Tesseract was trained on documents. Passing entire 1920x1080 screenshot will yield dubious performance.
Screenshots should be split into 3:4 aspect ratio images, ran in batches through tesseract, then recombined. This should result in much better OCR performance.