Closed zahidetastan closed 2 years ago
Hello @zahidetastan :wave:
Sorry about this, let's try to help you! Do you have a fork with a branch that has your modifications by any chance?
My best guess is that the string encoding doesn't work with some Turkish characters (or simply that the font doesn't support this character) :thinking: So let's try to check this with 3 steps:
Check whether the string encoding is a problem To do so, let's see what this piece of code outputs:
from doctr.datasets.utils import encode_sequences
encode_sequences(["NAKİT"], vocab=<YOUR_TURKISH_VOCAB>)
Check if the text image generation (or the font) is a problem For this, could you try to run this piece of code and post the resulting image please?
import matplotlib.pyplot as plt
from doctr.datasets.generator.base import synthesize_text_img
plt.imshow(synthesize_text_img("NAKİT")); plt.show()
Check whether the target generation is a problem
from doctr.datasets import CharacterGenerator, WordGenerator
char_ds = CharacterGenerator("<YOUR_TURKISH_VOCAB>")
print(list(map(char_ds.vocab.__getitem__, [sample[1] for sample in char_ds._data])))
word_ds = WordGenerator("<YOUR_TURKISH_VOCAB>")
print(word_ds._generate_string(100, 200))
Let me know if you run into any trouble! Cheers :v:
Hello @frgfm 👋 , First of all, thank you very much for your interest.
Here is my output of first checking for string encoding.
And here is my output of second checking for text image generation or the font problem. I can't get any resulting image due to the error.
So our problem is CAPITAL LETTER I WITH DOT ABOVE(İ): UnicodeEncodeError: 'latin-1' codec can't encode character '\u0130' in position 3: ordinal not in range(256)
Alright, this has narrowed the problem down to the image generation! This snippet should produce the same error then:
from doctr.utils.fonts import get_font
font = get_font(None, 32)
font.getsize("İ")
If so, I can see two options:
The second aspect is quite a natural one: you can't render the string if it cannot be mapped to a font family that has been installed in your environment. Both solutions shouldn't logically be implemented on docTR side, I suggest that you find a font family that allows the rendering :+1:
Hint: you haven't specified a font family, so it loads the default, which is quite limited. But if a common font family like Arial supports this character, you could do:
font = get_font("Arial.ttf", 32)
which should fix the problem :)
Let me know how it goes!
Any update @zahidetastan ? :)
Hello @frgfm Finding a font family that includes Turkish character solved the error. But... I training the model by running the push to hub part as a comment line. But I couldn't figure out where to get an output and how to use this trained model. Then I wanted to use the push_to_hub argument to see the trained model on the hub. In this part, I get an error about libraries.
Hi @zahidetastan :wave:, i can help with this. Are you currently on the main branch so you have forked the repo to train your model ?
docu: https://mindee.github.io/doctr/latest/using_doctr/sharing_models.html
Note: the hf hub integration is currently in an early state and fixed planned for the 0.6.0 release
Hi @felixdittrich92 👋, Yes I currently on main branch. And I fixed the problem about the login_to_hub error. By adding sys.path.append('.') before the importing login_to_hub line. And also there had a problem about importing doctr from venv instead of folder. I deleted the python-doctr library from my environment. After completing the training, I have tf_model folder + config.json file. Inside the tf_model folder there is checkpoint, weights.data-00000-of-00001 and weights.index file. How should I proceed after this stage?
Thanks for your all efford and help :)
Hi @zahidetastan :wave: ,
Pushing to hub with your own trained model:
from doctr.models import recognition, login_to_hub, push_to_hf_hub
login_to_hub()
my_awesome_model = recognition.crnn_vgg16_bn(pretrained=False, pretrained_backbone=False)
my_awesome_model.load_weights("path-to-your-trained-model-folder/weights")
push_to_hf_hub(my_awesome_model, model_name='doctr-crnn-vgg16-bn-turkish-v1', task='recognition', arch='crnn_vgg16_bn') # task and arch needs to match with model so in your case it is a recognition model and the arch ist crnn_vgg16_bn name can be free choosen
Loading example:
from doctr.io import DocumentFile
from doctr.models import ocr_predictor, from_hub
image = DocumentFile.from_images(['data/example.jpg'])
# Load a custom detection model from huggingface hub
det_model = from_hub('Felix92/doctr-tf-db-resnet50')
# Load a custom recognition model from huggingface hub
reco_model = from_hub('Felix92/doctr-tf-crnn-vgg16-bn-french')
# You can easily plug in this models to the OCR predictor
predictor = ocr_predictor(det_arch=det_model, reco_arch=reco_model)
# in your case:
predictor = ocr_predictor(reco_arch=reco_model, pretrained=True) # load pretrained text detection model and use custom reco model from hub
result = predictor(image)
PS: if your model works well on turkish i would be really happy to add it as the # 1 community model to the list https://mindee.github.io/doctr/latest/using_doctr/sharing_models.html#pretrained-community-models
Hi again, @felixdittrich92 👋, First of all, I am very curious about the final performance of this model for Turkish. Of course, if this model works well, I can add it.
When completing the recognition model training with tensorflow, I got an error like the one below, so there seems to be something missing in the config.json file.
It was also uploaded to the hub. Let me share the link with you.
https://huggingface.co/logo-data-science/crnn_vgg16_bn_20220831-104738/tree/main
When I ran the first code snippet you shared with me(Pushing to hub with your own trained model), I got an error like this below.
Hi @zahidetastan, i will check this tomorrow 👍 How have you passed the turkish vocab while training? Have you added the vocab in vocabs.py ? (I saw we have no turkish vocab currently 😅 ) For the second Image is this the absolute path ? (If no please try)
Hi @felixdittrich92, Yes, I passed the turkish vocab in vocabs.py on my local.😅 There is no problem at that point. I will check the path again. Thank you for all help ✋🏻
Hi @zahidetastan :wave:,
add turkish vocab in vocabs.py (you could open a PR if you want to add it fixed ? :+1: )
VOCABS['turkish'] = VOCABS['english'] + 'şŞıİĞÜçÇ'
(let me know if this is wrong :sweat_smile: )
trigger dummy run
python3 /home/felix/Desktop/doctr/references/recognition/train_tensorflow.py crnn_vgg16_bn --name turkishdummy --epochs 1 --vocab turkish --push-to-hub
console log:
Namespace(amp=False, arch='crnn_vgg16_bn', batch_size=64, epochs=1, find_lr=False, font='FreeMono.ttf,FreeSans.ttf,FreeSerif.ttf', input_size=32, lr=0.001, max_chars=12, min_chars=1, name='turkishdummy', pretrained=False, push_to_hub=True, resume=None, show_samples=False, test_only=False, train_path=None, train_samples=1000, val_path=None, val_samples=20, vocab='turkish', wb=False, workers=None)
git-lfs/3.1.2 (GitHub; linux amd64; go 1.17.6)
Validation set loaded in 0.01196s (2160 samples in 34 batches)
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
Train set loaded in 0.003183s (108000 samples in 1687 batches)
WARNING:tensorflow:From /home/felix/.conda/envs/doctr-dev-tf/lib/python3.8/site-packages/tensorflow/python/ops/ctc_ops.py:1442: alias_inplace_add (from tensorflow.python.ops.inplace_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer tf.tensor_scatter_nd_add, which offers the same functionality with well-defined read-write semantics.
WARNING:tensorflow:From /home/felix/.conda/envs/doctr-dev-tf/lib/python3.8/site-packages/tensorflow/python/ops/ctc_ops.py:1442: alias_inplace_add (from tensorflow.python.ops.inplace_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer tf.tensor_scatter_nd_add, which offers the same functionality with well-defined read-write semantics.
WARNING:tensorflow:From /home/felix/.conda/envs/doctr-dev-tf/lib/python3.8/site-packages/tensorflow/python/ops/ctc_ops.py:1425: alias_inplace_update (from tensorflow.python.ops.inplace_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer tf.tensor_scatter_nd_update, which offers the same functionality with well-defined read-write semantics.
WARNING:tensorflow:From /home/felix/.conda/envs/doctr-dev-tf/lib/python3.8/site-packages/tensorflow/python/ops/ctc_ops.py:1425: alias_inplace_update (from tensorflow.python.ops.inplace_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer tf.tensor_scatter_nd_update, which offers the same functionality with well-defined read-write semantics.
Validation loss decreased inf --> 17.5608: saving state...
Epoch 1/1 - Validation loss: 17.5608 (Exact: 5.32% | Partial: 5.56%)
/home/felix/.cache/huggingface/hub/turkishdummy is already a clone of https://huggingface.co/Felix92/turkishdummy. Make sure you pull the latest changes with `repo.git_pull()`.
WARNING:huggingface_hub.repository:/home/felix/.cache/huggingface/hub/turkishdummy is already a clone of https://huggingface.co/Felix92/turkishdummy. Make sure you pull the latest changes with `repo.git_pull()`.
Pulling changes ...
WARNING:huggingface_hub.repository:Pulling changes ...
Adding files tracked by Git LFS: ['tf_model/weights.data-00000-of-00001', 'tf_model/weights.index']. This may take a bit of time if the files are large.
WARNING:huggingface_hub.repository:Adding files tracked by Git LFS: ['tf_model/weights.data-00000-of-00001', 'tf_model/weights.index']. This may take a bit of time if the files are large.
Upload file tf_model/weights.data-00000-of-00001: 99%|████████████████████████████████████████████████████████████████████████████████████████▏| 59.7M/60.3M [01:39<00:01, 566kB/s]To https://huggingface.co/Felix92/turkishdummy█████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.25k/5.25k [00:00<?, ?B/s]
dfc9f52..2109c3b main -> main
WARNING:huggingface_hub.repository:To https://huggingface.co/Felix92/turkishdummy
dfc9f52..2109c3b main -> main
Upload file tf_model/weights.data-00000-of-00001: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 60.3M/60.3M [01:44<00:00, 606kB/s]
Upload file tf_model/weights.index: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.25k/5.25k [01:44<?, ?B/s]
Everything up-to-dateweights.index: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.25k/5.25k [01:44<?, ?B/s]
WARNING:huggingface_hub.repository:Everything up-to-date
WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use `status.expect_partial()`. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function.
WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use `status.expect_partial()`. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function.
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).layer_with_weights-26.kernel
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).layer_with_weights-26.kernel
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).layer_with_weights-26.bias
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).layer_with_weights-26.bias
https://huggingface.co/Felix92/turkishdummy/tree/main
config.json:
{
--
| "mean": [
| 0.694,
| 0.695,
| 0.693
| ],
| "std": [
| 0.299,
| 0.296,
| 0.301
| ],
| "input_shape": [
| 32,
| 128,
| 3
| ],
| "vocab": "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!\"#$%&'()*+,-./:;<=>?@[\\]^_`{\|}~°£€¥¢฿şŞıİĞÜçÇ",
| "url": "https://doctr-static.mindee.com/models?id=v0.3.0/crnn_vgg16_bn-76b7f2c6.zip&src=0",
| "arch": "crnn_vgg16_bn",
| "task": "recognition"
| }
from doctr.io import DocumentFile
from doctr.models import ocr_predictor, from_hub
image = DocumentFile.from_images(['/home/felix/Desktop/1.jpg']) reco_model = from_hub('Felix92/turkishdummy')
predictor = ocr_predictor(reco_arch=reco_model, pretrained=True) result = predictor(image)
console log:
USE_TF=1 python3 /home/felix/Desktop/doctr/test2.py
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 433/433 [00:00<00:00, 240kB/s]
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
DEBUG:tensorflow:Layer lstm_1 will use cuDNN kernels when running on GPU.
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.52k/1.52k [00:00<00:00, 791kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.57k/1.57k [00:00<00:00, 823kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 433/433 [00:00<00:00, 232kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71.0/71.0 [00:00<00:00, 38.8kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63.2M/63.2M [00:10<00:00, 6.24MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.38k/5.38k [00:00<00:00, 2.61MB/s]
WARNING:tensorflow:Detecting that an object or model or tf.train.Checkpoint is being deleted with unrestored values. See the following logs for the specific values in question. To silence these warnings, use status.expect_partial()
. See https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint#restorefor details about the status object returned by the restore function.
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).layer_with_weights-26.kernel
WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).layer_with_weights-26.bias
my env:
Collecting environment information...
DocTR version: 0.5.2a0 TensorFlow version: 2.8.1 PyTorch version: N/A (torchvision N/A) OpenCV version: 4.5.5 OS: Ubuntu 22.04.1 LTS Python version: 3.8.13 Is CUDA available (TensorFlow): Yes Is CUDA available (PyTorch): N/A CUDA runtime version: Could not collect GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Laptop GPU Nvidia driver version: 470.141.03 cuDNN version: Could not collect
So what you can try is:
replace on hub your config.json with my and try to load again
@zahidetastan let me know if i can delete the dummy model from my hub if you got all :+1:
One side note: i think this should make no difference but docTR is currently tested from python 3.6 up to 3.8
Hi @zahidetastan any updates ? :)
Hi @felixdittrich92, The dataset I had was a very small dataset. It contained 10 pieces of data that I created by clipping from images by manuel. So when I used it as the recognition model, I saw some random characters as a result. That's why I found a font with Turkish characters that is CODE2000.TTF. I made a dummy train with this font. And I use an image that contains some Turkish chars with font that I mentioned. We can see the result in the below. What should I expect as a dummy train result? Is there something I misunderstood here? Dummy train created 100,000 train data. Even if I create this much data and train the model with real data, will it be poorly for training?
Here is dummy model link with using CODE2000.TTF: https://huggingface.co/logo-data-science/crnn_vgg16_bn_20220909-164010
Thanks for your interest 💯
Hi @zahidetastan 👋 nice that the hf part now works correctly 👍 About the dataset i would suggest to fine tune the already pretrained model (on french you need only to pass --pretrained) on your real dataset (try to get as much data as possible) because the vocabs differ only in a few characters. About training with our word generator it needs some improvements and is more about pretraining or to debug models and is currently Not a good fit to train a prod ready model 😅
If you want to train vgg16 from scratch we talk about millions of data samples which are needed so the pretrained should be the best match in your scenario :)
Will convert it to a discussion looks like the main issue is solved :)
Bug description
I want to use docTR for training of Recognition of Turkish language data. I have created a dataset as specified in documentation. I have also added the vocab for Turkish in vocabs.py. Turkish language consists of VOCAB['English'] + some turkish chars. For example: i, ö, ü, İ. Therefore, there are some Turkish special characters in the train images and json file in the dataset. When I start the training with these dataset, I get an error like the one below. As can be seen in the Target section, the entered Turkish characters are not read as in the json file. For example, capital i in "TARİH: 24.08.2021" and "NAKİT".
And this is not just a situation specific to Turkish. When I start a train for Portuguese, it similarly does not accept the dataset containing values that match the characters in the VOCAB file.
Can someone explain why I am getting this error and what has to be changed? How can I train the dataset containing some special characters?
Code snippet to reproduce the bug
Here is my training script
Error traceback
Environment
--
Deep Learning backend
is_tf_available: True is_torch_available: False