mindee / doctr

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
https://mindee.github.io/doctr/
Apache License 2.0
3.79k stars 434 forks source link

Recognition Training with Tensorflow #1006

Closed zahidetastan closed 2 years ago

zahidetastan commented 2 years ago

Bug description

I want to use docTR for training of Recognition of Turkish language data. I have created a dataset as specified in documentation. I have also added the vocab for Turkish in vocabs.py. Turkish language consists of VOCAB['English'] + some turkish chars. For example: i, ö, ü, İ. Therefore, there are some Turkish special characters in the train images and json file in the dataset. When I start the training with these dataset, I get an error like the one below. As can be seen in the Target section, the entered Turkish characters are not read as in the json file. For example, capital i in "TARİH: 24.08.2021" and "NAKİT".

And this is not just a situation specific to Turkish. When I start a train for Portuguese, it similarly does not accept the dataset containing values ​​that match the characters in the VOCAB file.

image

Can someone explain why I am getting this error and what has to be changed? How can I train the dataset containing some special characters?

Code snippet to reproduce the bug

Here is my training script

python references/recognition/train_tensorflow.py crnn_vgg16_bn --train_path references/doctr-train --val_path references/doctr-valid --epochs 5 --vocab turkish

Error traceback

Namespace(arch='crnn_vgg16_bn', train_path='references/doctr-train', val_path='references/doctr-valid', train_samples=1000, val_samples=20, font='FreeMono.ttf,FreeSans.ttf,FreeSerif.ttf', min_chars=1, max_chars=12, name=None, epochs=5, batch_size=64, input_size=32, lr=0.001, workers=None, resume=None, vocab='turkish', test_only=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, amp=False, find_lr=False)
b'{\r\n    "1.png": "KDV",\r\n    "2.png": "*400,00",\r\n    "3.png": "TAR\xc4\xb0H:24.08.2021",\r\n    "4.png": "NAK\xc4\xb0T",\r\n    "5.png": "TOPLAM",\r\n    "6.png": "Bakiye:"\r\n}'
Validation set loaded in 0.001001s (6 samples in 1 batches)
Train set loaded in 0.0009604s (6 samples in 0 batches)
TARÄ°H:24.08.2021 //for debug, it prints file content
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~iİöÖÇçÜü // for debug it prints my turkish vocab
Traceback (most recent call last):
  File "C:\Users\x\Desktop\training-DOCTR\doctr\doctr\references\recognition\train_tensorflow.py", line 404, in <module>
    main(args)
  File "C:\Users\x\Desktop\training-DOCTR\doctr\doctr\references\recognition\train_tensorflow.py", line 330, in main
    val_loss, exact_match, partial_match = evaluate(model, val_loader, batch_transforms, val_metric)
  File "C:\Users\x\Desktop\training-DOCTR\doctr\doctr\references\recognition\train_tensorflow.py", line 115, in evaluate
    out = model(images, targets, return_preds=True, training=False)
  File "C:\Users\x\Desktop\training-DOCTR\doctr\doctr\trainvenv\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "c:\users\x\desktop\training-doctr\doctr\doctr\doctr\models\recognition\crnn\tensorflow.py", line 229, in call
    out['loss'] = self.compute_loss(logits, target)
  File "c:\users\x\desktop\training-doctr\doctr\doctr\doctr\models\recognition\crnn\tensorflow.py", line 184, in compute_loss
    gt, seq_len = self.build_target(target)
  File "c:\users\x\desktop\training-doctr\doctr\doctr\doctr\models\recognition\core.py", line 35, in build_target
    encoded = encode_sequences(
  File "c:\users\x\desktop\training-doctr\doctr\doctr\doctr\datasets\utils.py", line 153, in encode_sequences
    for idx, seq in enumerate(map(partial(encode_string, vocab=vocab), sequences)):
  File "c:\users\x\desktop\training-doctr\doctr\doctr\doctr\datasets\utils.py", line 80, in encode_string
    raise ValueError("some characters cannot be found in 'vocab'")
ValueError: Exception encountered when calling layer "crnn" (type CRNN).

some characters cannot be found in 'vocab'

Call arguments received by layer "crnn" (type CRNN):
  • x=tf.Tensor(shape=(6, 32, 128, 3), dtype=float32)
  • target=["'KDV'", "'*400,00'", "'TARÄ°H:24.08.2021'", "'NAKÄ°T'", "'TOPLAM'", "'Bakiye:'"]
  • return_model_output=False
  • return_preds=True
  • beam_width=1
  • top_paths=1
  • kwargs={'training': 'False'}

Environment

--

Deep Learning backend

is_tf_available: True is_torch_available: False

frgfm commented 2 years ago

Hello @zahidetastan :wave:

Sorry about this, let's try to help you! Do you have a fork with a branch that has your modifications by any chance?

My best guess is that the string encoding doesn't work with some Turkish characters (or simply that the font doesn't support this character) :thinking: So let's try to check this with 3 steps:

  1. Check whether the string encoding is a problem To do so, let's see what this piece of code outputs:

    from doctr.datasets.utils import encode_sequences
    encode_sequences(["NAKİT"], vocab=<YOUR_TURKISH_VOCAB>)
  2. Check if the text image generation (or the font) is a problem For this, could you try to run this piece of code and post the resulting image please?

    import matplotlib.pyplot as plt
    from doctr.datasets.generator.base import synthesize_text_img
    plt.imshow(synthesize_text_img("NAKİT")); plt.show()
  3. Check whether the target generation is a problem

    from doctr.datasets import CharacterGenerator, WordGenerator
    char_ds = CharacterGenerator("<YOUR_TURKISH_VOCAB>")
    print(list(map(char_ds.vocab.__getitem__, [sample[1] for sample in char_ds._data])))
    word_ds = WordGenerator("<YOUR_TURKISH_VOCAB>")
    print(word_ds._generate_string(100, 200))

Let me know if you run into any trouble! Cheers :v:

zahidetastan commented 2 years ago

Hello @frgfm 👋 , First of all, thank you very much for your interest.

Here is my output of first checking for string encoding.

option1

And here is my output of second checking for text image generation or the font problem. I can't get any resulting image due to the error. option2

So our problem is CAPITAL LETTER I WITH DOT ABOVE(İ): UnicodeEncodeError: 'latin-1' codec can't encode character '\u0130' in position 3: ordinal not in range(256)

frgfm commented 2 years ago

Alright, this has narrowed the problem down to the image generation! This snippet should produce the same error then:

from doctr.utils.fonts import get_font

font = get_font(None, 32)
font.getsize("İ")

If so, I can see two options:

The second aspect is quite a natural one: you can't render the string if it cannot be mapped to a font family that has been installed in your environment. Both solutions shouldn't logically be implemented on docTR side, I suggest that you find a font family that allows the rendering :+1:

Hint: you haven't specified a font family, so it loads the default, which is quite limited. But if a common font family like Arial supports this character, you could do:

font = get_font("Arial.ttf", 32)

which should fix the problem :)

Let me know how it goes!

frgfm commented 2 years ago

Any update @zahidetastan ? :)

zahidetastan commented 2 years ago

Hello @frgfm Finding a font family that includes Turkish character solved the error. But... I training the model by running the push to hub part as a comment line. But I couldn't figure out where to get an output and how to use this trained model. Then I wanted to use the push_to_hub argument to see the trained model on the hub. In this part, I get an error about libraries.

MicrosoftTeams-image (1)

felixdittrich92 commented 2 years ago

Hi @zahidetastan :wave:, i can help with this. Are you currently on the main branch so you have forked the repo to train your model ?

docu: https://mindee.github.io/doctr/latest/using_doctr/sharing_models.html

Note: the hf hub integration is currently in an early state and fixed planned for the 0.6.0 release

zahidetastan commented 2 years ago

Hi @felixdittrich92 👋, Yes I currently on main branch. And I fixed the problem about the login_to_hub error. By adding sys.path.append('.') before the importing login_to_hub line. And also there had a problem about importing doctr from venv instead of folder. I deleted the python-doctr library from my environment. After completing the training, I have tf_model folder + config.json file. Inside the tf_model folder there is checkpoint, weights.data-00000-of-00001 and weights.index file. How should I proceed after this stage?

Thanks for your all efford and help :)

felixdittrich92 commented 2 years ago

Hi @zahidetastan :wave: ,

Pushing to hub with your own trained model:

from doctr.models import recognition, login_to_hub, push_to_hf_hub
login_to_hub()
my_awesome_model = recognition.crnn_vgg16_bn(pretrained=False, pretrained_backbone=False)
my_awesome_model.load_weights("path-to-your-trained-model-folder/weights")

push_to_hf_hub(my_awesome_model, model_name='doctr-crnn-vgg16-bn-turkish-v1', task='recognition', arch='crnn_vgg16_bn')  # task and arch needs to match with model so in your case it is a recognition model and the arch ist crnn_vgg16_bn    name can be free choosen

Loading example:

from doctr.io import DocumentFile
from doctr.models import ocr_predictor, from_hub
image = DocumentFile.from_images(['data/example.jpg'])
# Load a custom detection model from huggingface hub
det_model = from_hub('Felix92/doctr-tf-db-resnet50')
# Load a custom recognition model from huggingface hub
reco_model = from_hub('Felix92/doctr-tf-crnn-vgg16-bn-french')
# You can easily plug in this models to the OCR predictor
predictor = ocr_predictor(det_arch=det_model, reco_arch=reco_model)

# in your case:
predictor = ocr_predictor(reco_arch=reco_model, pretrained=True) # load pretrained text detection model and use custom reco model from hub
result = predictor(image)

PS: if your model works well on turkish i would be really happy to add it as the # 1 community model to the list https://mindee.github.io/doctr/latest/using_doctr/sharing_models.html#pretrained-community-models

zahidetastan commented 2 years ago

Hi again, @felixdittrich92 👋, First of all, I am very curious about the final performance of this model for Turkish. Of course, if this model works well, I can add it.

When completing the recognition model training with tensorflow, I got an error like the one below, so there seems to be something missing in the config.json file.

image

It was also uploaded to the hub. Let me share the link with you.

https://huggingface.co/logo-data-science/crnn_vgg16_bn_20220831-104738/tree/main

When I ran the first code snippet you shared with me(Pushing to hub with your own trained model), I got an error like this below.

image

felixdittrich92 commented 2 years ago

Hi @zahidetastan, i will check this tomorrow 👍 How have you passed the turkish vocab while training? Have you added the vocab in vocabs.py ? (I saw we have no turkish vocab currently 😅 ) For the second Image is this the absolute path ? (If no please try)

zahidetastan commented 2 years ago

Hi @felixdittrich92, Yes, I passed the turkish vocab in vocabs.py on my local.😅 There is no problem at that point. I will check the path again. Thank you for all help ✋🏻

felixdittrich92 commented 2 years ago

Hi @zahidetastan :wave:,

  1. add turkish vocab in vocabs.py (you could open a PR if you want to add it fixed ? :+1: ) VOCABS['turkish'] = VOCABS['english'] + 'şŞıİĞÜçÇ' (let me know if this is wrong :sweat_smile: )

  2. trigger dummy run python3 /home/felix/Desktop/doctr/references/recognition/train_tensorflow.py crnn_vgg16_bn --name turkishdummy --epochs 1 --vocab turkish --push-to-hub

console log:

Namespace(amp=False, arch='crnn_vgg16_bn', batch_size=64, epochs=1, find_lr=False, font='FreeMono.ttf,FreeSans.ttf,FreeSerif.ttf', input_size=32, lr=0.001, max_chars=12, min_chars=1, name='turkishdummy', pretrained=False, push_to_hub=True, resume=None, show_samples=False, test_only=False, train_path=None, train_samples=1000, val_path=None, val_samples=20, vocab='turkish', wb=False, workers=None)
git-lfs/3.1.2 (GitHub; linux amd64; go 1.17.6)
Validation set loaded in 0.01196s (2160 samples in 34 batches)
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
Train set loaded in 0.003183s (108000 samples in 1687 batches)
WARNING:tensorflow:From /home/felix/.conda/envs/doctr-dev-tf/lib/python3.8/site-packages/tensorflow/python/ops/ctc_ops.py:1442: alias_inplace_add (from tensorflow.python.ops.inplace_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer tf.tensor_scatter_nd_add, which offers the same functionality with well-defined read-write semantics.
WARNING:tensorflow:From /home/felix/.conda/envs/doctr-dev-tf/lib/python3.8/site-packages/tensorflow/python/ops/ctc_ops.py:1442: alias_inplace_add (from tensorflow.python.ops.inplace_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer tf.tensor_scatter_nd_add, which offers the same functionality with well-defined read-write semantics.
WARNING:tensorflow:From /home/felix/.conda/envs/doctr-dev-tf/lib/python3.8/site-packages/tensorflow/python/ops/ctc_ops.py:1425: alias_inplace_update (from tensorflow.python.ops.inplace_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer tf.tensor_scatter_nd_update, which offers the same functionality with well-defined read-write semantics.
WARNING:tensorflow:From /home/felix/.conda/envs/doctr-dev-tf/lib/python3.8/site-packages/tensorflow/python/ops/ctc_ops.py:1425: alias_inplace_update (from tensorflow.python.ops.inplace_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer tf.tensor_scatter_nd_update, which offers the same functionality with well-defined read-write semantics.
Validation loss decreased inf --> 17.5608: saving state...                                                                                                      
Epoch 1/1 - Validation loss: 17.5608 (Exact: 5.32% | Partial: 5.56%)
/home/felix/.cache/huggingface/hub/turkishdummy is already a clone of https://huggingface.co/Felix92/turkishdummy. Make sure you pull the latest changes with `repo.git_pull()`.
WARNING:huggingface_hub.repository:/home/felix/.cache/huggingface/hub/turkishdummy is already a clone of https://huggingface.co/Felix92/turkishdummy. Make sure you pull the latest changes with `repo.git_pull()`.
Pulling changes ...
WARNING:huggingface_hub.repository:Pulling changes ...
Adding files tracked by Git LFS: ['tf_model/weights.data-00000-of-00001', 'tf_model/weights.index']. This may take a bit of time if the files are large.
WARNING:huggingface_hub.repository:Adding files tracked by Git LFS: ['tf_model/weights.data-00000-of-00001', 'tf_model/weights.index']. This may take a bit of time if the files are large.
Upload file tf_model/weights.data-00000-of-00001:  99%|████████████████████████████████████████████████████████████████████████████████████████▏| 59.7M/60.3M [01:39<00:01, 566kB/s]To https://huggingface.co/Felix92/turkishdummy█████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.25k/5.25k [00:00<?, ?B/s]
   dfc9f52..2109c3b  main -> main

WARNING:huggingface_hub.repository:To https://huggingface.co/Felix92/turkishdummy
   dfc9f52..2109c3b  main -> main

Upload file tf_model/weights.data-00000-of-00001: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 60.3M/60.3M [01:44<00:00, 606kB/s]
Upload file tf_model/weights.index: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.25k/5.25k [01:44<?, ?B/s]
Everything up-to-dateweights.index: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.25k/5.25k [01:44<?, ?B/s]

WARNING:huggingface_hub.repository:Everything up-to-date

WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use `status.expect_partial()`. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function.
WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use `status.expect_partial()`. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function.
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).layer_with_weights-26.kernel
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).layer_with_weights-26.kernel
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).layer_with_weights-26.bias
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).layer_with_weights-26.bias

https://huggingface.co/Felix92/turkishdummy/tree/main

config.json:

{
--
  | "mean": [
  | 0.694,
  | 0.695,
  | 0.693
  | ],
  | "std": [
  | 0.299,
  | 0.296,
  | 0.301
  | ],
  | "input_shape": [
  | 32,
  | 128,
  | 3
  | ],
  | "vocab": "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!\"#$%&'()*+,-./:;<=>?@[\\]^_`{\|}~°£€¥¢฿şŞıİĞÜçÇ",
  | "url": "https://doctr-static.mindee.com/models?id=v0.3.0/crnn_vgg16_bn-76b7f2c6.zip&src=0",
  | "arch": "crnn_vgg16_bn",
  | "task": "recognition"
  | }
  1. loading from hub
    
    from doctr.io import DocumentFile
    from doctr.models import ocr_predictor, from_hub

image = DocumentFile.from_images(['/home/felix/Desktop/1.jpg']) reco_model = from_hub('Felix92/turkishdummy')

You can easily plug in this models to the OCR predictor

predictor = ocr_predictor(reco_arch=reco_model, pretrained=True) result = predictor(image)


console log:

USE_TF=1 python3 /home/felix/Desktop/doctr/test2.py Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 433/433 [00:00<00:00, 240kB/s] DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU. DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU. DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU. DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU. DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU. DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU. Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.52k/1.52k [00:00<00:00, 791kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.57k/1.57k [00:00<00:00, 823kB/s] Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 433/433 [00:00<00:00, 232kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71.0/71.0 [00:00<00:00, 38.8kB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63.2M/63.2M [00:10<00:00, 6.24MB/s] Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.38k/5.38k [00:00<00:00, 2.61MB/s] WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use status.expect_partial(). See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function. WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).layer_with_weights-26.kernel WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).layer_with_weights-26.bias


my env:

Collecting environment information...

DocTR version: 0.5.2a0 TensorFlow version: 2.8.1 PyTorch version: N/A (torchvision N/A) OpenCV version: 4.5.5 OS: Ubuntu 22.04.1 LTS Python version: 3.8.13 Is CUDA available (TensorFlow): Yes Is CUDA available (PyTorch): N/A CUDA runtime version: Could not collect GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Laptop GPU Nvidia driver version: 470.141.03 cuDNN version: Could not collect



So what you can try is: 
replace on hub your config.json with my and try to load again 
felixdittrich92 commented 2 years ago

@zahidetastan let me know if i can delete the dummy model from my hub if you got all :+1:

One side note: i think this should make no difference but docTR is currently tested from python 3.6 up to 3.8

felixdittrich92 commented 2 years ago

Hi @zahidetastan any updates ? :)

zahidetastan commented 2 years ago

Hi @felixdittrich92, The dataset I had was a very small dataset. It contained 10 pieces of data that I created by clipping from images by manuel. So when I used it as the recognition model, I saw some random characters as a result. That's why I found a font with Turkish characters that is CODE2000.TTF. I made a dummy train with this font. And I use an image that contains some Turkish chars with font that I mentioned. We can see the result in the below. What should I expect as a dummy train result? Is there something I misunderstood here? Dummy train created 100,000 train data. Even if I create this much data and train the model with real data, will it be poorly for training?

MicrosoftTeams-image

Here is dummy model link with using CODE2000.TTF: https://huggingface.co/logo-data-science/crnn_vgg16_bn_20220909-164010

Thanks for your interest 💯

felixdittrich92 commented 2 years ago

Hi @zahidetastan 👋 nice that the hf part now works correctly 👍 About the dataset i would suggest to fine tune the already pretrained model (on french you need only to pass --pretrained) on your real dataset (try to get as much data as possible) because the vocabs differ only in a few characters. About training with our word generator it needs some improvements and is more about pretraining or to debug models and is currently Not a good fit to train a prod ready model 😅

If you want to train vgg16 from scratch we talk about millions of data samples which are needed so the pretrained should be the best match in your scenario :)

felixdittrich92 commented 2 years ago

Will convert it to a discussion looks like the main issue is solved :)