mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
746 stars 131 forks source link

Training for Devanagari #173

Closed Shreeshrii closed 4 years ago

Shreeshrii commented 4 years ago

I am trying to build a Devanagari model using kraken. When I use default values for training it works but when I specify training and eval data separately, I get a codec error.

The following uses the same set of trainingdata.

This worked:

ketos train devatrain/*.png > devatrain.log

WARNING: Logging before flag parsing goes to stderr.
W0218 04:40:21.871934 127598883522176 __init__.py:74] TensorFlow version 1.15.0 detected. Last version known to be fully compatible is 1.14.0 .
Initializing model ✓

This gets the error:

ketos -v  train -t devatrain/*.png -e devatrain/*.png -o devatraintest > devatraintest.log

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/py36/bin/ketos", line 8, in <module>
    sys.exit(cli())
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 1135, in invoke
    sub_ctx = cmd.make_context(cmd_name, args, parent=ctx)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 641, in make_context
    self.parse_args(ctx, args)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 940, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 1477, in handle_parse_result
    self.callback, ctx, self, value)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 96, in invoke_param_callback
    return callback(ctx, param, value)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/ketos.py", line 63, in _validate_manifests
    for entry in manifest.readlines():
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
dstoekl commented 4 years ago

I think you simply have to change the order of arguments. The training and testing should be at the end.

mittagessen commented 4 years ago

Your issue is that -e/-v expects a single file containing a list of image/xml files while you hand it a bunch of images. Unfortunately, there is a fundamental limitation in the CLI library that an option can't have an arbitrary number of values.

Shreeshrii commented 4 years ago

@mittagessen Thanks!

The following worked:

ls -1 deva/*.png > deva.manifest
ls -1 devaeval/*.png > devaeval.manifest
nohup ketos -v train --resize add -i devatrain_model_best.mlmodel -t deva.manifest -e devaeval.manifest -o sa-add > sa-add.log &

WARNING: Logging before flag parsing goes to stderr.
W0218 11:13:55.503840 126403667904128 __init__.py:74] TensorFlow version 1.15.0 detected. Last version known to be fully compatible is 1.14.0 .
[6.1101] Building ground truth set from 13100 line images 
I0218 11:13:55.709644 126403667904128 ketos.py:155] Building ground truth set from 13100 line images
[6.1103] Loading existing model from devatrain_model_best.mlmodel  
I0218 11:13:55.709892 126403667904128 ketos.py:164] Loading existing model from devatrain_model_best.mlmodel 
[7.5724] Disabling preloading for large (>2500) training data set. Enable by setting --preload parameter 
I0218 11:13:57.172013 126403667904128 ketos.py:208] Disabling preloading for large (>2500) training data set. Enable by setting --preload parameter

Question: Should I set --preload if I have a large number of training images?

mittagessen commented 4 years ago

OK, so there are a few mechanisms interacting and the impact differs a bit depending on if you use the old (box) line-wise training or the new page-wise baseline one.

With the old format there's a dewarping step performed on the line before it is fed into the model. This is fairly slow (~0.5-1s/line, CPU-bound).

When you enable preloading this dewarping is performed once at the beginning and everything is kept in memory, otherwise each line is loaded and preprocessed ad-hoc over and over again. For large data sets you might not have sufficient memory to do preloading. Another reason to disable preloading is if you have a lot of cores, just throwing more workers at the preprocessing with --threads will be faster than preloading everything in a single thread in the beginning. This is doubly true if you've got a GPU.

For the new segmenter, the preprocessing for the recognizer is faster but I/O-bound as lines are sampled randomly across all pages but each page image has to be loaded in its entirety for each line, so which strategy is faster depends on I/O speed/latency.

amitdo commented 4 years ago

W0218 04:40:21.871934 127598883522176 init.py:74] TensorFlow version 1.15.0 detected. Last version known to be fully compatible is 1.14.0 .

This warning seems to come from coremltools. Maybe there is way to tell coremltools to not search for TensorFlow.

Shreeshrii commented 4 years ago

@mittagessen Thank you for the detailed explanation.

Currently I am trying training with a subset of existing trainingdata that I used for tesseract, converted to single line .png and .gt.txt. I have done one training with synthetic data of multiple fonts and want to augment it with additional synthetic data and scanned pages to build a generic Devanagari model.

Is there any option which can be given with ketos transcribe that will load the existing gt.txt along with png in the html file - mainly for reviewing the transcription?

Shreeshrii commented 4 years ago

Current model test with similar images, not seen during training -

WARNING: Logging before flag parsing goes to stderr.
W0219 03:01:58.175485 128668646590080 __init__.py:74] TensorFlow version 1.15.0 detected. Last version known to be fully compatible is 1.14.0 .
Loading model devatrain_model_best.mlmodel  ✓
Evaluating devatrain_model_best.mlmodel
=== report  ===

392232  Characters
29388   Errors
92.51%  Accuracy

16610   Insertions
3409    Deletions
9369    Substitutions

Count   Missed  %Right
299156  17626   94.11%  Devanagari
93074   8351    91.03%  Common
2   2   0.00%   Inherited

Errors  Correct-Generated
5260    { 0xa } - {  }
2660    { DEVANAGARI SIGN VIRAMA } - {  }
974 { र } - {  }
759 {  } - { DEVANAGARI SIGN VIRAMA }
724 { DEVANAGARI SIGN ANUSVARA } - {  }
529 { SPACE } - {  }
476 { त } - {  }
413 {  } - { SPACE }
306 {  } - { र }
288 { व } - {  }
280 { DEVANAGARI VOWEL SIGN AA } - {  }
275 { DEVANAGARI SIGN NUKTA } - {  }
231 { DEVANAGARI VOWEL SIGN U } - {  }
214 {  } - { DEVANAGARI SIGN ANUSVARA }
207 { DEVANAGARI VOWEL SIGN I } - {  }
204 { प } - {  }
196 { ष } - {  }
195 { य } - {  }
173 { : } - { DEVANAGARI SIGN VISARGA }
172 { DEVANAGARI VOWEL SIGN E } - {  }
170 { न } - { र }
164 { . } - {  }
162 {  } - { DEVANAGARI VOWEL SIGN AA }
161 { न } - {  }
158 { द } - {  }
151 { DEVANAGARI SIGN VISARGA } - {  }
142 { च } - {  }
141 { च } - { व }
141 { ध } - { थ }
137 { क } - {  }
131 {  } - { व }
126 { ध } - {  }
118 { म } - {  }
116 { श } - {  }
116 { ञ } - {  }
115 { ज } - {  }
115 { स } - { म }
108 { व } - { य }
106 { ब } - { व }
106 { DEVANAGARI VOWEL SIGN VOCALIC R } - {  }

...

Average accuracy: 92.51%, (stddev: 0.00)
mittagessen commented 4 years ago

Not really. The easiest way would be to write a segmentation file for each of them but then you've got half a million HTML files. On the other hand you can run the normal kraken .... ocr command with the -s/--no-segmentation mode to treat each input image as a single line.

mittagessen commented 4 years ago

You didn't strip out trailing newlines in your test set. That's roughly 5000 errors right there.

Shreeshrii commented 4 years ago

Is there any option which can be given with ketos transcribe that will load the existing gt.txt along with png in the html file - mainly for reviewing the transcription?

If I understand correctly, currently 'correction.html' generated by ketos transcribe has the images and blank space for entering the transcription. I wanted to know if it was possible to prefill them with gt.txt (in the case where single line images and their single line transcription is available).

mittagessen commented 4 years ago

If I understand correctly, currently 'correction.html' generated by ketos transcribe has the images and blank space for entering the transcription. I wanted to know if it was possible to prefill them with gt.txt (in the case where single line images and their single line transcription is available).

Unfortunately not. There's a mode to prefill using an existing model (--prefill) but that's it. Frankly, that whole thing is legacy code anyway since the new segmenter is up and running.

Shreeshrii commented 4 years ago

You didn't strip out trailing newlines in your test set. That's roughly 5000 errors right there.

Thanks for pointing that out. I have made the change to the gt.txt files

Shreeshrii commented 4 years ago

I ran ketos transcribe on a few test images. I used the --prefill option using my test model.

While transcription.html shows all 7 line images, only 5 boxes are visible for entering the groundtruth.

Also, ketos extract is not creating any files from the html.

zip file with the test files is attached.

archive.zip

mittagessen commented 4 years ago

While transcription.html shows all 7 line images, only 5 boxes are visible for entering the groundtruth.

Because ketos transcribe expects whole page images and you're feeding in single line strips. The old segmenter is not designed for that (like a lot of other material) and just fails to find any lines for those 2 line images. As mentioned you can provide a manual segmentation (in this case one bounding box of the image size) to the transcribe command but that only works for a 1 page per HTML setup.

Also, ketos extract is not creating any files from the html.

Weird, it works for me on your HTML file. Which version of kraken are you running?

Shreeshrii commented 4 years ago
ubuntu@tesseract-ocr:~$ conda activate py36
(py36) ubuntu@tesseract-ocr:~$ kraken --version
kraken, version 2.0.8
Shreeshrii commented 4 years ago

@mittagessen I installed the beta version just now.

 kraken --version
kraken, version 3.0.0.0b3

ketos transcribe with prefill is getting an error now.

ketos transcribe *.png --prefill /home/ubuntu/kraken/deva/devanew_8.mlmodel
WARNING: Logging before flag parsing goes to stderr.
W0220 14:00:50.879532 124946201921312 __init__.py:74] TensorFlow version 1.15.0 detected. Last version known to be fully compatible is 1.14.0 .
Loading ANNTraceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/py36/bin/ketos", line 8, in <module>
    sys.exit(cli())
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/ketos.py", line 912, in transcription
    prefill = models.load_any(prefill)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/lib/models.py", line 157, in load_any
    seq = TorchSeqRecognizer(nn, train=train, device=device)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/lib/models.py", line 52, in __init__
    if nn.model_type not in [None, 'recognition']:
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/lib/vgsl.py", line 471, in model_type
    return self.user_metadata['model_type']
KeyError: 'model_type'
Shreeshrii commented 4 years ago

Also `ketos transcribe' is not finding all lines even when using page images as input.

Please see attached html file which has samples of different errors:

transcription.zip

mittagessen commented 4 years ago

Your model error has been fixed in one of the subsequent commits. There was a bug where one of the fields wasn't default-filled for models not trained on that branch. You won't get these metadata fields magically, they have to be added during training and they're chiefly important for the new segmenter as they ensure that model types and image modes aren't mixed up. What kind of metadata are you expecting?

  • Incorrect line segmentation for indented lines for verse.
  • verse numbers at end of line after white space being ignored.
  • No line recognized in one image.
  • One line recognized in one image.
  • Page range in a Table of contents format page being treated as a separate column.

Yes, that's the old segmenter for you. It's less than ideal for anything but Latin, Greek, Hebrew, and to a lesser extend Arabic. The new one fixes all that by being trainable but it works fundamentally differently and the ketos transcribe/ketos extract workflow won't be adapted for it.

Shreeshrii commented 4 years ago

Is there any documentation on how to use the new segmenter?

Your model error has been fixed in one of the subsequent commits.

Which branch should I use to get it?

Shreeshrii commented 4 years ago

built kraken from blla branch.

(py36) ubuntu@tesseract-ocr:~/kraken/deva/pages$ kraken --version
kraken, version 3.0.0.0b4.dev9

A different error now:

(py36) ubuntu@tesseract-ocr:~/kraken/deva/pages$  ketos transcribe *.tif --prefill /home/ubuntu/kraken/deva/devanew_8.mlmodel
WARNING: Logging before flag parsing goes to stderr.
W0220 15:58:40.596257 134746554844800 __init__.py:74] TensorFlow version 1.15.0 detected. Last version known to be fully compatible is 1.14.0 .
Loading ANN✓

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/py36/bin/ketos", line 8, in <module>
    sys.exit(cli())
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/ketos.py", line 976, in transcription
    ti.add_page(im, res, records=preds)
  File "/home/ubuntu/anaconda3/envs/py36/lib/python3.6/site-packages/kraken/transcribe.py", line 74, in add_page
    'left': 100*int(bbox[0]) / im.size[0],
TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'
Shreeshrii commented 4 years ago

I have uploaded my training and test data and resulting model at https://github.com/Shreeshrii/kraken_devanagari .

Shreeshrii commented 4 years ago

@mittagessen Thank you for your prompt response and guidance.