Closed DrRSatzteil closed 1 year ago
Hi @DrRSatzteil :wave:,
thanks for opening this issue. But i think we should keep the original language specific vocabularies :smiley:
What you can do:
cd /doctr
pip install -e . # -e (editable)
--vocab=german
(if you push the model to HF hub it will keep your custom vocab after loading)Hi @felixdittrich92
that's fine for me and I can understand this decision. I implemented the change you proposed already so I could continue with my training set.
It's a special case and some might probably argue that it is plain wrong to use this letter in a German text. However there are also names of course that contain this letter so in reality it is unavoidable that this will happen.
But then again this is an open source project and I can just implement this change for myself 😅
Thank you for your kind reply!
Hello, I'm new here and I tried to add a new character but I'm getting this error when starting training recognition:
Traceback (most recent call last):
File "/home/rainer/Code/Tirocinio/OCRScript/Doctr/doctr/references/recognition/train_pytorch.py", line 469, in
@DrRSatzteil is it possible to know how did you managed to add a new character in the vocab?
Hi,
I added it like this:
ADDITIONAL_VOCAB = '§ñéç'
model = crnn_vgg16_bn( pretrained=False, vocab=vocabs.VOCABS['german'] + ADDITIONAL_VOCAB)
You can check out my project here https://github.com/DrRSatzteil/metadatamagic/blob/main/metadatamagic/analysis/documentanalyser.py where I played around a bit. It's by no means a finished software (and it most likely never will be) but you can have a look there to see how I did it.
Thank you very much for your answer, but I tought you were adding a new vocab when training a model
Oh sorry, yes of course...
I don't have the code right here, I'll post an update when I'm on the machine on which I performed the training.
Thank you, it would help me a lot!!
As far as I can see the only change I made was to add a new VOCAB to the datasets/vocabs.py file:
VOCABS["germancustom"] = VOCABS["german"] + "§ñéç"
And then reference it when starting the training:
python references/recognition/train_pytorch.py crnn_vgg16_bn --epochs 15 --vocab germancustom --pretrained --train_path /some/path/to/trainset --val_path /some/path/to/valset
i'm also dealing with german letters recognition , can you help me how to use the vocab or do you have any trained models ?
I tried to add my custom lang like this: (the added character is a space)
VOCABS["lang_with_space"] = VOCABS["latin"] + "°" + "àâéèêëîïôùûçÀÂÉÈËÎÏÔÙÛÇ" + VOCABS["currency"] + " "
I start the training:
python3 train_pytorch.py crnn_vgg16_bn --train_path=/home/rainer/Code/Tirocinio/fineTuner/sroie/docTR/TextRecognition_train/ --val_path=/home/rainer/Code/Tirocinio/fineTuner/sroie/docTR/TextRecognition_val/ --vocab=lang_with_space --pretrained
And I get this error:
RuntimeError: Error(s) in loading state_dict for CRNN: size mismatch for linear.weight: copying a param with shape torch.Size([124, 256]) from checkpoint, the shape in current model is torch.Size([125, 256]). size mismatch for linear.bias: copying a param with shape torch.Size([124]) from checkpoint, the shape in current model is torch.Size([125]).
Did you have the same problem? I saw another person that had the same problem but in his case he wasn't training, he was just doing inference
First of all: I don't think it makes sense to add spaces to the vocab, I don't see how these could be identified as such in a document (there is lots of spaces everywhere on a page).
Second: I'm afraid I don't remember if I stumbled exactly across the mentioned error, I just remember that the error messages never really gave me a hint what could be wrong, which is a bit frustrating.
In fact I still do have a trained model which worked reasonably well. I guess there's still room for improvement but I can share this with you if you let me know where I can drop it?!
Yes you are right adding space makes no sense. I'm trying to train docTR with the ICDAR2019 dataset and in the ground truth there's a lot of spaces, so I tought I could just add it manually instead of modifying the ground truth.
It could be helpful to use your trained model, @Talhaz need it to, is it possible to contact you trough linkedin?
Oh hold on it's right there:
https://github.com/DrRSatzteil/metadatamagic/tree/main/metadatamagic/dist/models
Go for it 😊
Thank you very much for your help!!!
@DrRSatzteil I have managed to train crnn_vgg16_bn
by adding a custom dictionary and modifying another file explained here #1133. So my problem is solved, now I have a question about the code you sent me.
This code:
model = crnn_vgg16_bn( pretrained=False, vocab=vocabs.VOCABS['german'] + ADDITIONAL_VOCAB)
I was thinking of using ocr_predictor()
(but only modifying the reco predictor so that I can set pretrained=false
) and my question is how do you use the crnn_vgg16_bn()
?
You can find the code that I finally used here: https://github.com/DrRSatzteil/metadatamagic/blob/main/metadatamagic/analysis/documentanalyser.py
`model = crnn_vgg16_bn( pretrained=False, vocab=vocabs.VOCABS['german'] + ADDITIONAL_VOCAB)
model_file = find_recognition_model()
if model_file:
model.load_state_dict(torch.load(model_file))
predictor = ocr_predictor(
reco_arch=model, pretrained=True, detect_language=True)
doc = DocumentFile.from_pdf(document.pdf, scale=4)
result = predictor(doc) `
Thank you, I have solved my problems!!
@xReniar Can you upload your model and share it?
@ffalkenberg sure ,i'm going to send it to your outlook email
@xReniar Do you want to share it maybe so more people could benefit ? :) https://mindee.github.io/doctr/using_doctr/sharing_models.html#pushing-to-the-huggingface-hub
For illustration, consider the following image
Comparison of the default crnn_vgg16_bn
vs. DrRSatzteil's
model:
@felixdittrich92, you're right. This is the link of my model: https://huggingface.co/Reniar/doctr-crnn_vgg16_bn-space
@DrRSatzteil @ffalkenberg @xReniar
I've had a dataset in my head for some time now that shows the complete European vocabulary (optimally also partly handwritten), but I don't have the raw images / PDFs atm if you would be interested in helping out, please let me know. I could probably do the labeling via Azure Document AI (~150k pages - 200k pages) (No synth generated data)
@felixdittrich92 Absolutely great idea! I'd love to help. Where do you envision sourcing the data from?
@ffalkenberg That's the question :) Maybe searching for open document archives (newspapers, magazines, ..), hf datasets, scraping some websites (https://github.com/simonw/shot-scraper), ..
I just used a few hundred of private documents (mostly invoices and the like) for the training of my model. Unfortunately (but also obviously) I cannot share these publicly, so I don't have any good source for training documents either.
I was searching back then for a German dataset that could be accessed freely but wasn't able to find something suitable. You can find huge collections of old books from museums but the typography of these old documents doesn't have much in common with modern texts so it does not really make sense to use these.
@DrRSatzteil @ffalkenberg Some freely available magazines in different languages seems to be a good source (lots of different fonts, backgrounds, ..)
Yeah i think we can start to experiment with much less data (~5k pages excluding handwritten)
Great idea to incorporate newspapers! Quick question – by 'labeling using Azure,' do you mean using Azure's auto-labeling capabilities?
Correct everything i need would be raw PDF's or Images which i can feed trough Azure's Document AI. Afterwards i can prepare the output to doctr conform labels to create a dataset. :)
Still I think it might be hard to get free pdf files of newspapers. Even the news providers I pay for don't offer this service. I actually found some sites that offer a large selection of magazines but they all seem to be semi-legal at best so I would not want to use those.
https://pdf-giant.top/german/ for example, but yes you are right @DrRSatzteil i need to check this carefull
Hey they at least claim to have the right to offer these magazines. However if I don't understand the business plan (buy commercial licenses for magazines to offer them for free?) I doubt that only half of this is true...
Haha ok if you try to download you a redirected to pay ^^
I found this for example:
https://pdf-magazines-download.com
But it's definitely not a valid option 😅
I found this for example:
https://pdf-magazines-download.com
But it's definitely not a valid option 😅
Same if you click download it opened the same page to pay 😅
You can also download for free but it seems like it's one of the usual dubious file hoster business models.
🚀 The feature
Actually the character 'é' is not officially part of the German vocabulary. However it is used for some words in German nevertheless (usually with a french origin). According to the German wikipedia arcticle it is the most common letter in German which is not part of the German alphabet (0.01%).
Examples: Coupé Soufflés Varieté
Motivation, pitch
I have a couple of words with this letter in my training set. Training fails unless I add this letter to the german vocab manually.
Alternatives
No response
Additional context
No response