KrakenInvalidModelException when using clstm file from kraken-models

tidoe commented 9 months ago

Hi,

I want to process some Syriac texts using your syriac-monotype model. I receive the following errors:

Trying via command line

kraken -i image.jpeg lines.json binarize segment ocr -m kraken-models/clstm/toy/test.clstm

Output:

/xxx/env/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: dlopen(/xxx/env/lib/python3.8/site-packages/torchvision/image.so, 0x0006): Symbol not found: __ZN3c106detail19maybe_wrap_dim_slowExxb
  Referenced from: <8E58E83E-9235-3324-9B6B-260614F85F69> /xxx/env/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <3F9923D2-81A5-3EC8-9739-EC0C1C816132> /xxx/env/lib/python3.8/site-packages/torch/lib/libc10.dylib
  warn(f"Failed to load image Python extension: {e}")
scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.1 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
Loading ANN kraken-models/clstm/toy/test.clstm  ✗

Trying via Python

import os
from kraken.lib import vgsl

model_path = os.path.join("kraken-models", "clstm", "syriac-monotype", "syriac.clstm")
model = vgsl.TorchVGSLModel.load_model(model_path)

Output:

scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.1 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
Traceback (most recent call last):
  File "/xxx/env/lib/python3.8/site-packages/kraken/lib/vgsl.py", line 283, in load_model
    mlmodel = MLModel(path)
  File "/xxx/env/lib/python3.8/site-packages/coremltools/models/model.py", line 340, in __init__
    self.__proxy__, self._spec, self._framework_error = _get_proxy_and_spec(
  File "/xxx/env/lib/python3.8/site-packages/coremltools/models/model.py", line 132, in _get_proxy_and_spec
    specification = _load_spec(filename)
  File "/xxx/env/lib/python3.8/site-packages/coremltools/models/utils.py", line 226, in load_spec
    spec.ParseFromString(f.read())
  File "/xxx/env/lib/python3.8/site-packages/google/protobuf/message.py", line 202, in ParseFromString
    return self.MergeFromString(serialized)
  File "/xxx/env/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1128, in MergeFromString
    if self._InternalParse(serialized, 0, length) != length:
  File "/xxx/env/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1195, in InternalParse
    pos = field_decoder(buffer, new_pos, end, self, field_dict)
  File "/xxx/env/lib/python3.8/site-packages/google/protobuf/internal/decoder.py", line 726, in DecodeField
    if value._InternalParse(buffer, pos, new_pos) != new_pos:
  File "/xxx/env/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 1188, in InternalParse
    new_pos = local_SkipField(buffer, old_pos, end, tag_bytes)
  File "/xxx/env/lib/python3.8/site-packages/google/protobuf/internal/decoder.py", line 1025, in SkipField
    return WIRETYPE_TO_SKIPPER[wire_type](buffer, pos, end)
  File "/xxx/env/lib/python3.8/site-packages/google/protobuf/internal/decoder.py", line 899, in _SkipFixed64
    raise _DecodeError('Truncated message.')
google.protobuf.message.DecodeError: Truncated message.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "kraken_test.py", line 8, in <module>
    model = vgsl.TorchVGSLModel.load_model(model_path)
  File "/xxx/env/lib/python3.8/site-packages/kraken/lib/vgsl.py", line 287, in load_model
    raise KrakenInvalidModelException('Failure parsing model protobuf: {}'.format(str(e))) from e
kraken.lib.exceptions.KrakenInvalidModelException: Failure parsing model protobuf: Truncated message.

What can I do to get the model running? I am currently using the latest version of kraken, but I could also switch to another one if necessary.

Best tidoe

mittagessen commented 9 months ago

On 23/10/09 06:51AM, Tillmann Dönicke wrote:

I want to process some Syriac texts using your syriac-monotype model. I receive the following errors:

The CLSTM models have been deprecated a long time ago (and if I remember correctly those Syriac models were only trained on artificial training data anyway) because the library isn't maintained anymore and isn't that flexible. There's one model here [0] that should work although I can't vouch for its quality. Otherwise, I do vaguely remember there being some training data and models somewhere but I need to find the person responsible for those (probably George Kiraz).

[0] https://zenodo.org/record/4699756

mittagessen commented 9 months ago

PS: The model I linked is old enough that there have been some changes in the default network architecture. That doesn't mean it won't work but retraining on the same data with a current kraken version would probably reduce its error quite a bit.

mittagessen commented 9 months ago

I've found the training data (~17k lines) but unfortunately don't have the right to share it. I'll train a new model and upload it to the repository.

tidoe commented 9 months ago

I've found the training data (~17k lines) but unfortunately don't have the right to share it. I'll train a new model and upload it to the repository.

That'd be great! I was just about to ask some follow-up questions regarding the https://zenodo.org/record/4699756 model, but now I'll wait for the new model.

mittagessen commented 9 months ago

It finished training with fairly decent validation metrics (<2% CER, <8% WER). You can get the model from the repository with:

$ kraken get 10.5281/zenodo.8425684

I'd appreciate feedback on its quality, as I don't have any information on the representativeness of the dataset nor its typographic properties.

tidoe commented 9 months ago

Thank you! ... But it still doesn't work for me. :-(

I try to run it in Python:

from kraken import binarization, blla, serialization
from kraken.lib import vgsl
from PIL import Image

model = vgsl.TorchVGSLModel.load_model("omnisyr_best.mlmodel")
baseline_seg = blla.segment(Image.open("image.jpeg"), model=model)

When I downloaded the modal via kraken get ... then I got a FileNotFoundError because Python was looking for the model in the project directory. So, I downloaded omnisyr_best.mlmodel manually from Zenodo and placed it in the project directory. When running the Python code, it says:

RuntimeWarning: You will not be able to run predict() on this Core ML model. Underlying exception message was: Error compiling model: "compiler error:  Error reading protobuf spec. validator error: Input MLMultiArray to neural networks must have dimension 1 (vector) or 3 (image-like arrays).".
  _warnings.warn(
Traceback (most recent call last):
  File "kraken_test.py", line 9, in <module>
    baseline_seg = blla.segment(Image.open("image.jpeg"), model=model)
  File "/xxx/env/lib/python3.8/site-packages/kraken/blla.py", line 318, in segment
    raise KrakenInvalidModelException(f'Invalid model type {nn.model_type} for {nn}')
kraken.lib.exceptions.KrakenInvalidModelException: Invalid model type recognition for <kraken.lib.vgsl.TorchVGSLModel object at 0x2905fc460>

I also tried it via the command line:

kraken -i image.jpeg segmentation.json segment -bl -i omnisyr_best.mlmodel

Output:

Loading ANN omnisyr_best.mlmodel    ✓
Segmenting  ✗
[10/10/23 14:21:04] ERROR    Failed processing image.jpeg: 1                                                                                  kraken.py:418

Could you provide me a working example how to properly use the model in Python?

mittagessen commented 9 months ago

The model is a text recognition model and you're trying to run the segmenter with it. Do you want to only get a segmentation or the actual text? Is the default segmentation model insufficient for your material?

The run segmentation AND recognition you should do something like:

kraken -i image.jpeg out.txt segment -bl ocr -m omnisyr_best.mlmodel

or roughly:

from kraken.lib import models
from kraken import blla, rpred
from PIL import Image

model = models.load_any("omnisyr_best.mlmodel")
baseline_seg = blla.segment(Image.open("image.jpeg"))
preds = [x for x in rpred.rpred(model, im, baseline_seg)]

tidoe commented 9 months ago

Okay, thank you! The model is running now. I performed an evaluation on my collection of Syriac manuscripts and got an average CER of 64% (61% when removing Syriac diacritics).

mittagessen commented 9 months ago

On 23/10/17 02:50AM, Tillmann Dönicke wrote:

Okay, thank you! The model is running now. I performed an evaluation on my collection of Syriac manuscripts and got an average CER of 64% (61% when removing Syriac diacritics).

Arrrgh, you should have mentioned that you're working on manuscripts. You can find models for some common styles and a combined model here [0]. They should perform a lot better. 60% character accuracy (not CER) is around random output for a fairly small script as Syriac.

[0] https://github.com/dstoekl/kraken_models/

dstoekl commented 9 months ago

@tidoe If you have further ground truth and can share it I will be happy to train a larger model.

mittagessen / kraken

KrakenInvalidModelException when using clstm file from kraken-models #546

Trying via command line

Trying via Python