mindee / doctr

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
https://mindee.github.io/doctr/
Apache License 2.0
3.58k stars 418 forks source link

Multilingual support #1699

Open decadance-dance opened 3 weeks ago

decadance-dance commented 3 weeks ago

🚀 The feature

Support of multiple languages (accordingly VOCABS["multilingual"]) by pretrained models.

Motivation, pitch

It would be great to use models which supports multiple languages because it significantly improve user experience in various cases.

Alternatives

No response

Additional context

No response

felixdittrich92 commented 3 weeks ago

Hi @decadance-dance :wave:,

Have you already tried: docTR: https://huggingface.co/Felix92/doctr-torch-parseq-multilingual-v1 OnnxTR: https://huggingface.co/Felix92/onnxtr-parseq-multilingual-v1 ? :)

Depends a bit if there is any data from mindee we could use. Question goes to @odulcy-mindee ^^

decadance-dance commented 3 weeks ago

Hi, @felixdittrich92 I used docTR more than half year but have never faced this multilingual model, lol. So, I am gonna try it, thanks.

felixdittrich92 commented 3 weeks ago

Ah let's keep this issue open there is more todo i think :)

felixdittrich92 commented 3 weeks ago

Hi, @felixdittrich92 I used docTR more than half year but have never faced this multilingual model, lol. So, I am gonna try it, thanks.

Happy about an feedback how it works for you :) The model was fine tuned only on synth data.

odulcy-mindee commented 2 weeks ago

Depends a bit if there is any data from mindee we could use. Question goes to @odulcy-mindee ^^

Unfortunately, we don't have such data

felixdittrich92 commented 2 weeks ago

@decadance-dance For training such recognition models i don't see a problem.. we can generate synth train data and need in a best case only real val samples. But for detection we would need real data that's the main issue.

In general we would need the help of the community to collect documents (newspaper, receipt photos, etc.) in divers langauges (can be unlabeled). / This would need a license to sign that we can freely use this data. With enough divers data we could use Azure Doc AI for example to pre-label this data. Later on i wouldn't see an issue to open source this dataset.

But not sure how to trigger such "event" :sweat_smile: @odulcy-mindee

nikokks commented 1 week ago

Hello =) I found some public dataset for various tasks english documents mathematics documents latex ocr latex ocr chinese ocr chinese ocr chinese ocr

nikokks commented 1 week ago

Moreover it should be interesting for Chinese detection models to add multiple recognition data in the same image without intersection. This should help for a Chinese detection model to perform better without real detection data. Anyone interested in creating random multilingual data for detection models (hindi, chinese, etc.) ?

felixdittrich92 commented 1 week ago

Hi @nikokks 😃 Recognition should not be such a big deal i found already a good way to generate such data for fine tuning.

To collect multilingual data for detection is troublesome because it should be real data (or if possible really good generated ones / for example with a fine tuned FLUX model maybe !?) We need different kinds of layouts/documents (newspapers, invoices, receipts, cards, etc.) so the data should come close to real use cases (not only scans also document photos etc.) :)