Closed Shreeshrii closed 4 years ago
I think you simply have to change the order of arguments. The training and testing should be at the end.
I am trying to build a Devanagari model using kraken. When I use default values for training it works but when I specify training and eval data separately, I get a codec error.
Your issue is that -e/-v expects a single file containing a list of image/xml files while you hand it a bunch of images. Unfortunately, there is a fundamental limitation in the CLI library that an option can't have an arbitrary number of values.
@mittagessen Thanks!
The following worked:
ls -1 deva/*.png > deva.manifest
ls -1 devaeval/*.png > devaeval.manifest
nohup ketos -v train --resize add -i devatrain_model_best.mlmodel -t deva.manifest -e devaeval.manifest -o sa-add > sa-add.log &
WARNING: Logging before flag parsing goes to stderr.
W0218 11:13:55.503840 126403667904128 __init__.py:74] TensorFlow version 1.15.0 detected. Last version known to be fully compatible is 1.14.0 .
[6.1101] Building ground truth set from 13100 line images
I0218 11:13:55.709644 126403667904128 ketos.py:155] Building ground truth set from 13100 line images
[6.1103] Loading existing model from devatrain_model_best.mlmodel
I0218 11:13:55.709892 126403667904128 ketos.py:164] Loading existing model from devatrain_model_best.mlmodel
[7.5724] Disabling preloading for large (>2500) training data set. Enable by setting --preload parameter
I0218 11:13:57.172013 126403667904128 ketos.py:208] Disabling preloading for large (>2500) training data set. Enable by setting --preload parameter
Question: Should I set --preload
if I have a large number of training images?
Question: Should I set
--preload
if I have a large number of training images?
OK, so there are a few mechanisms interacting and the impact differs a bit depending on if you use the old (box) line-wise training or the new page-wise baseline one.
With the old format there's a dewarping step performed on the line before it is fed into the model. This is fairly slow (~0.5-1s/line, CPU-bound).
When you enable preloading this dewarping is performed once at the beginning and everything is kept in memory, otherwise each line is loaded and preprocessed ad-hoc over and over again. For large data sets you might not have sufficient memory to do preloading. Another reason to disable preloading is if you have a lot of cores, just throwing more workers at the preprocessing with --threads will be faster than preloading everything in a single thread in the beginning. This is doubly true if you've got a GPU.
For the new segmenter, the preprocessing for the recognizer is faster but I/O-bound as lines are sampled randomly across all pages but each page image has to be loaded in its entirety for each line, so which strategy is faster depends on I/O speed/latency.
W0218 04:40:21.871934 127598883522176 init.py:74] TensorFlow version 1.15.0 detected. Last version known to be fully compatible is 1.14.0 .
This warning seems to come from coremltools. Maybe there is way to tell coremltools to not search for TensorFlow.
@mittagessen Thank you for the detailed explanation.
Currently I am trying training with a subset of existing trainingdata that I used for tesseract, converted to single line .png
and .gt.txt
. I have done one training with synthetic data of multiple fonts and want to augment it with additional synthetic data and scanned pages to build a generic Devanagari model.
Is there any option which can be given with ketos transcribe
that will load the existing gt.txt
along with png
in the html
file - mainly for reviewing the transcription?
Current model test with similar images, not seen during training -
WARNING: Logging before flag parsing goes to stderr.
W0219 03:01:58.175485 128668646590080 __init__.py:74] TensorFlow version 1.15.0 detected. Last version known to be fully compatible is 1.14.0 .
Loading model devatrain_model_best.mlmodel ✓
Evaluating devatrain_model_best.mlmodel
=== report ===
392232 Characters
29388 Errors
92.51% Accuracy
16610 Insertions
3409 Deletions
9369 Substitutions
Count Missed %Right
299156 17626 94.11% Devanagari
93074 8351 91.03% Common
2 2 0.00% Inherited
Errors Correct-Generated
5260 { 0xa } - { }
2660 { DEVANAGARI SIGN VIRAMA } - { }
974 { र } - { }
759 { } - { DEVANAGARI SIGN VIRAMA }
724 { DEVANAGARI SIGN ANUSVARA } - { }
529 { SPACE } - { }
476 { त } - { }
413 { } - { SPACE }
306 { } - { र }
288 { व } - { }
280 { DEVANAGARI VOWEL SIGN AA } - { }
275 { DEVANAGARI SIGN NUKTA } - { }
231 { DEVANAGARI VOWEL SIGN U } - { }
214 { } - { DEVANAGARI SIGN ANUSVARA }
207 { DEVANAGARI VOWEL SIGN I } - { }
204 { प } - { }
196 { ष } - { }
195 { य } - { }
173 { : } - { DEVANAGARI SIGN VISARGA }
172 { DEVANAGARI VOWEL SIGN E } - { }
170 { न } - { र }
164 { . } - { }
162 { } - { DEVANAGARI VOWEL SIGN AA }
161 { न } - { }
158 { द } - { }
151 { DEVANAGARI SIGN VISARGA } - { }
142 { च } - { }
141 { च } - { व }
141 { ध } - { थ }
137 { क } - { }
131 { } - { व }
126 { ध } - { }
118 { म } - { }
116 { श } - { }
116 { ञ } - { }
115 { ज } - { }
115 { स } - { म }
108 { व } - { य }
106 { ब } - { व }
106 { DEVANAGARI VOWEL SIGN VOCALIC R } - { }
...
Average accuracy: 92.51%, (stddev: 0.00)
Not really. The easiest way would be to write a segmentation file for
each of them but then you've got half a million HTML files. On the other
hand you can run the normal kraken .... ocr
command with the
-s/--no-segmentation
mode to treat each input image as a single line.
Current model test with similar images, not seen during training -
You didn't strip out trailing newlines in your test set. That's roughly 5000 errors right there.
Is there any option which can be given with ketos transcribe that will load the existing gt.txt along with png in the html file - mainly for reviewing the transcription?
If I understand correctly, currently 'correction.html' generated by ketos transcribe has the images and blank space for entering the transcription. I wanted to know if it was possible to prefill them with gt.txt (in the case where single line images and their single line transcription is available).
If I understand correctly, currently 'correction.html' generated by ketos transcribe has the images and blank space for entering the transcription. I wanted to know if it was possible to prefill them with gt.txt (in the case where single line images and their single line transcription is available).
Unfortunately not. There's a mode to prefill using an existing model
(--prefill
) but that's it. Frankly, that whole thing is legacy code
anyway since the new segmenter is up and running.
You didn't strip out trailing newlines in your test set. That's roughly 5000 errors right there.
Thanks for pointing that out. I have made the change to the gt.txt files
I ran ketos transcribe on a few test images. I used the --prefill
option using my test model.
While transcription.html shows all 7 line images, only 5 boxes are visible for entering the groundtruth.
Also, ketos extract
is not creating any files from the html.
zip file with the test files is attached.
While transcription.html shows all 7 line images, only 5 boxes are visible for entering the groundtruth.
Because ketos transcribe
expects whole page images and you're feeding
in single line strips. The old segmenter is not designed for that (like
a lot of other material) and just fails to find any lines for those 2
line images. As mentioned you can provide a manual segmentation (in this
case one bounding box of the image size) to the transcribe command but
that only works for a 1 page per HTML setup.
Also,
ketos extract
is not creating any files from the html.
Weird, it works for me on your HTML file. Which version of kraken are you running?
ubuntu@tesseract-ocr:~$ conda activate py36
(py36) ubuntu@tesseract-ocr:~$ kraken --version
kraken, version 2.0.8
@mittagessen I installed the beta version just now.
kraken --version
kraken, version 3.0.0.0b3
ketos transcribe
with prefill is getting an error now.
ketos transcribe *.png --prefill /home/ubuntu/kraken/deva/devanew_8.mlmodel
WARNING: Logging before flag parsing goes to stderr.
W0220 14:00:50.879532 124946201921312 __init__.py:74] TensorFlow version 1.15.0 detected. Last version known to be fully compatible is 1.14.0 .
Loading ANNTraceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/py36/bin/ketos", line 8, in <module>
sys.exit(cli())
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/ketos.py", line 912, in transcription
prefill = models.load_any(prefill)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/lib/models.py", line 157, in load_any
seq = TorchSeqRecognizer(nn, train=train, device=device)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/lib/models.py", line 52, in __init__
if nn.model_type not in [None, 'recognition']:
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/lib/vgsl.py", line 471, in model_type
return self.user_metadata['model_type']
KeyError: 'model_type'
Also `ketos transcribe' is not finding all lines even when using page images as input.
Please see attached html file which has samples of different errors:
Your model error has been fixed in one of the subsequent commits. There was a bug where one of the fields wasn't default-filled for models not trained on that branch. You won't get these metadata fields magically, they have to be added during training and they're chiefly important for the new segmenter as they ensure that model types and image modes aren't mixed up. What kind of metadata are you expecting?
- Incorrect line segmentation for indented lines for verse.
- verse numbers at end of line after white space being ignored.
- No line recognized in one image.
- One line recognized in one image.
- Page range in a Table of contents format page being treated as a separate column.
Yes, that's the old segmenter for you. It's less than ideal for anything
but Latin, Greek, Hebrew, and to a lesser extend Arabic. The new one
fixes all that by being trainable but it works fundamentally differently
and the ketos transcribe/ketos extract
workflow won't be adapted for
it.
Is there any documentation on how to use the new segmenter?
Your model error has been fixed in one of the subsequent commits.
Which branch should I use to get it?
built kraken from blla branch.
(py36) ubuntu@tesseract-ocr:~/kraken/deva/pages$ kraken --version
kraken, version 3.0.0.0b4.dev9
A different error now:
(py36) ubuntu@tesseract-ocr:~/kraken/deva/pages$ ketos transcribe *.tif --prefill /home/ubuntu/kraken/deva/devanew_8.mlmodel
WARNING: Logging before flag parsing goes to stderr.
W0220 15:58:40.596257 134746554844800 __init__.py:74] TensorFlow version 1.15.0 detected. Last version known to be fully compatible is 1.14.0 .
Loading ANN✓
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/py36/bin/ketos", line 8, in <module>
sys.exit(cli())
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/ketos.py", line 976, in transcription
ti.add_page(im, res, records=preds)
File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/transcribe.py", line 74, in add_page
'left': 100*int(bbox[0]) / im.size[0],
TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'
I have uploaded my training and test data and resulting model at https://github.com/Shreeshrii/kraken_devanagari .
@mittagessen Thank you for your prompt response and guidance.
I am trying to build a Devanagari model using kraken. When I use default values for training it works but when I specify training and eval data separately, I get a codec error.
The following uses the same set of trainingdata.
This worked:
This gets the error: