Make é part of the german vocab

DrRSatzteil commented 1 year ago

🚀 The feature

Actually the character 'é' is not officially part of the German vocabulary. However it is used for some words in German nevertheless (usually with a french origin). According to the German wikipedia arcticle it is the most common letter in German which is not part of the German alphabet (0.01%).

Examples: Coupé Soufflés Varieté

Motivation, pitch

I have a couple of words with this letter in my training set. Training fails unless I add this letter to the german vocab manually.

Alternatives

No response

Additional context

No response

felixdittrich92 commented 1 year ago

Hi @DrRSatzteil :wave:,

thanks for opening this issue. But i think we should keep the original language specific vocabularies :smiley:

What you can do:

clone the repo main branch
cd /doctr
install with: pip install -e . # -e (editable)
add é to the german vocab in vocabs.py and save the file
train your model with flag --vocab=german(if you push the model to HF hub it will keep your custom vocab after loading)

DrRSatzteil commented 1 year ago

Hi @felixdittrich92

that's fine for me and I can understand this decision. I implemented the change you proposed already so I could continue with my training set.

It's a special case and some might probably argue that it is plain wrong to use this letter in a German text. However there are also names of course that contain this letter so in reality it is unavoidable that this will happen.

But then again this is an open source project and I can just implement this change for myself 😅

Thank you for your kind reply!

xReniar commented 1 year ago

Hello, I'm new here and I tried to add a new character but I'm getting this error when starting training recognition: Traceback (most recent call last): File "/home/rainer/Code/Tirocinio/OCRScript/Doctr/doctr/references/recognition/train_pytorch.py", line 469, in main(args) File "/home/rainer/Code/Tirocinio/OCRScript/Doctr/doctr/references/recognition/train_pytorch.py", line 380, in main fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, mb, amp=args.amp) File "/home/rainer/Code/Tirocinio/OCRScript/Doctr/doctr/references/recognition/train_pytorch.py", line 118, in fit_one_epoch train_loss = model(images, targets)["loss"] File "/home/rainer/Code/Tirocinio/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/rainer/Code/Tirocinio/OCRScript/Doctr/doctr/doctr/models/recognition/crnn/pytorch.py", line 233, in forward out["loss"] = self.compute_loss(logits, target) File "/home/rainer/Code/Tirocinio/OCRScript/Doctr/doctr/doctr/models/recognition/crnn/pytorch.py", line 191, in compute_loss ctc_loss = F.ctc_loss( File "/home/rainer/Code/Tirocinio/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2632, in ctc_loss return torch.ctc_loss( RuntimeError: Expected tensor to have size at least 45 at dimension 1, but got size 32 for argument #2 'targets' (while checking arguments for ctc_loss_cpu)

@DrRSatzteil is it possible to know how did you managed to add a new character in the vocab?

DrRSatzteil commented 1 year ago

Hi,

I added it like this:

ADDITIONAL_VOCAB = '§ñéç'

model = crnn_vgg16_bn( pretrained=False, vocab=vocabs.VOCABS['german'] + ADDITIONAL_VOCAB)

You can check out my project here https://github.com/DrRSatzteil/metadatamagic/blob/main/metadatamagic/analysis/documentanalyser.py where I played around a bit. It's by no means a finished software (and it most likely never will be) but you can have a look there to see how I did it.

xReniar commented 1 year ago

Thank you very much for your answer, but I tought you were adding a new vocab when training a model

DrRSatzteil commented 1 year ago

Oh sorry, yes of course...

I don't have the code right here, I'll post an update when I'm on the machine on which I performed the training.

xReniar commented 1 year ago

Thank you, it would help me a lot!!

DrRSatzteil commented 1 year ago

As far as I can see the only change I made was to add a new VOCAB to the datasets/vocabs.py file: VOCABS["germancustom"] = VOCABS["german"] + "§ñéç"

And then reference it when starting the training: python references/recognition/train_pytorch.py crnn_vgg16_bn --epochs 15 --vocab germancustom --pretrained --train_path /some/path/to/trainset --val_path /some/path/to/valset

Talhaz commented 1 year ago

i'm also dealing with german letters recognition , can you help me how to use the vocab or do you have any trained models ?

xReniar commented 1 year ago

I tried to add my custom lang like this: (the added character is a space) VOCABS["lang_with_space"] = VOCABS["latin"] + "°" + "àâéèêëîïôùûçÀÂÉÈËÎÏÔÙÛÇ" + VOCABS["currency"] + " " I start the training: python3 train_pytorch.py crnn_vgg16_bn --train_path=/home/rainer/Code/Tirocinio/fineTuner/sroie/docTR/TextRecognition_train/ --val_path=/home/rainer/Code/Tirocinio/fineTuner/sroie/docTR/TextRecognition_val/ --vocab=lang_with_space --pretrained And I get this error: RuntimeError: Error(s) in loading state_dict for CRNN: size mismatch for linear.weight: copying a param with shape torch.Size([124, 256]) from checkpoint, the shape in current model is torch.Size([125, 256]). size mismatch for linear.bias: copying a param with shape torch.Size([124]) from checkpoint, the shape in current model is torch.Size([125]).

Did you have the same problem? I saw another person that had the same problem but in his case he wasn't training, he was just doing inference

DrRSatzteil commented 1 year ago

First of all: I don't think it makes sense to add spaces to the vocab, I don't see how these could be identified as such in a document (there is lots of spaces everywhere on a page).

Second: I'm afraid I don't remember if I stumbled exactly across the mentioned error, I just remember that the error messages never really gave me a hint what could be wrong, which is a bit frustrating.

In fact I still do have a trained model which worked reasonably well. I guess there's still room for improvement but I can share this with you if you let me know where I can drop it?!

xReniar commented 1 year ago

Yes you are right adding space makes no sense. I'm trying to train docTR with the ICDAR2019 dataset and in the ground truth there's a lot of spaces, so I tought I could just add it manually instead of modifying the ground truth.

It could be helpful to use your trained model, @Talhaz need it to, is it possible to contact you trough linkedin?

DrRSatzteil commented 1 year ago

Oh hold on it's right there:

https://github.com/DrRSatzteil/metadatamagic/tree/main/metadatamagic/dist/models

Go for it 😊

xReniar commented 1 year ago

Thank you very much for your help!!!

xReniar commented 1 year ago

@DrRSatzteil I have managed to train crnn_vgg16_bn by adding a custom dictionary and modifying another file explained here #1133. So my problem is solved, now I have a question about the code you sent me. This code: model = crnn_vgg16_bn( pretrained=False, vocab=vocabs.VOCABS['german'] + ADDITIONAL_VOCAB)

I was thinking of using ocr_predictor() (but only modifying the reco predictor so that I can set pretrained=false) and my question is how do you use the crnn_vgg16_bn()?

DrRSatzteil commented 1 year ago

You can find the code that I finally used here: https://github.com/DrRSatzteil/metadatamagic/blob/main/metadatamagic/analysis/documentanalyser.py

`model = crnn_vgg16_bn( pretrained=False, vocab=vocabs.VOCABS['german'] + ADDITIONAL_VOCAB)

model_file = find_recognition_model()
if model_file:
    model.load_state_dict(torch.load(model_file))

predictor = ocr_predictor(
    reco_arch=model, pretrained=True, detect_language=True)

doc = DocumentFile.from_pdf(document.pdf, scale=4)
result = predictor(doc) `

xReniar commented 1 year ago

Thank you, I have solved my problems!!

ffalkenberg commented 1 year ago

@xReniar Can you upload your model and share it?

xReniar commented 1 year ago

@ffalkenberg sure ,i'm going to send it to your outlook email

felixdittrich92 commented 1 year ago

@xReniar Do you want to share it maybe so more people could benefit ? :) https://mindee.github.io/doctr/using_doctr/sharing_models.html#pushing-to-the-huggingface-hub

ffalkenberg commented 1 year ago

For illustration, consider the following image

Comparison of the default crnn_vgg16_bn vs. DrRSatzteil's model:

xReniar commented 1 year ago

@felixdittrich92, you're right. This is the link of my model: https://huggingface.co/Reniar/doctr-crnn_vgg16_bn-space

felixdittrich92 commented 1 year ago

@DrRSatzteil @ffalkenberg @xReniar

I've had a dataset in my head for some time now that shows the complete European vocabulary (optimally also partly handwritten), but I don't have the raw images / PDFs atm if you would be interested in helping out, please let me know. I could probably do the labeling via Azure Document AI (~150k pages - 200k pages) (No synth generated data)

ffalkenberg commented 1 year ago

@felixdittrich92 Absolutely great idea! I'd love to help. Where do you envision sourcing the data from?

felixdittrich92 commented 1 year ago

@ffalkenberg That's the question :) Maybe searching for open document archives (newspapers, magazines, ..), hf datasets, scraping some websites (https://github.com/simonw/shot-scraper), ..

DrRSatzteil commented 1 year ago

I just used a few hundred of private documents (mostly invoices and the like) for the training of my model. Unfortunately (but also obviously) I cannot share these publicly, so I don't have any good source for training documents either.

I was searching back then for a German dataset that could be accessed freely but wasn't able to find something suitable. You can find huge collections of old books from museums but the typography of these old documents doesn't have much in common with modern texts so it does not really make sense to use these.

felixdittrich92 commented 1 year ago

@DrRSatzteil @ffalkenberg Some freely available magazines in different languages seems to be a good source (lots of different fonts, backgrounds, ..)

Yeah i think we can start to experiment with much less data (~5k pages excluding handwritten)

ffalkenberg commented 1 year ago

Great idea to incorporate newspapers! Quick question – by 'labeling using Azure,' do you mean using Azure's auto-labeling capabilities?

felixT2K commented 1 year ago

Correct everything i need would be raw PDF's or Images which i can feed trough Azure's Document AI. Afterwards i can prepare the output to doctr conform labels to create a dataset. :)

DrRSatzteil commented 1 year ago

Still I think it might be hard to get free pdf files of newspapers. Even the news providers I pay for don't offer this service. I actually found some sites that offer a large selection of magazines but they all seem to be semi-legal at best so I would not want to use those.

felixT2K commented 1 year ago

https://pdf-giant.top/german/ for example, but yes you are right @DrRSatzteil i need to check this carefull

DrRSatzteil commented 1 year ago

Hey they at least claim to have the right to offer these magazines. However if I don't understand the business plan (buy commercial licenses for magazines to offer them for free?) I doubt that only half of this is true...

felixT2K commented 1 year ago

Haha ok if you try to download you a redirected to pay ^^

DrRSatzteil commented 1 year ago

I found this for example:

https://pdf-magazines-download.com

But it's definitely not a valid option 😅

felixT2K commented 1 year ago

I found this for example:

https://pdf-magazines-download.com

But it's definitely not a valid option 😅

Same if you click download it opened the same page to pay 😅

DrRSatzteil commented 1 year ago

You can also download for free but it seems like it's one of the usual dubious file hoster business models.

mindee / doctr