mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
720 stars 130 forks source link

Ability to `--lines` with a suffix for `-I` #105

Closed PonteIneptique closed 4 years ago

PonteIneptique commented 5 years ago

The idea would be to be able to run segmentation once and then be able to run different models using a globlike input with a suffix for segmentation.

Right now, I deal with this with like that

    if lines and lines.startswith('suffix:'):
        lines = base_image+lines.split("suffix:")[-1]
mittagessen commented 5 years ago

Sorry, I don't really understand what you're trying to achieve. Are you providing multiple external segmentations to the ocr subcommand when defining more than one input file? Or are you running multiple models on the same data? The first is definitely something we should implement, although I'd prefer a format string similar to how ketos extract output can be defined.

PonteIneptique commented 5 years ago

The first to do the second later :)

mittagessen commented 5 years ago

Ah, to do the second use the ketos test command if you want to just compare the accuracy of multiple models. It will be fairly slow though because it re-encodes/re-normalizes the complete dataset for each model.

If you want to create a pull request of your code I can merge it. Ideally following a similar format to extract, we've got way too much slightly different behavior anyway.

PonteIneptique commented 5 years ago

I'll PR ASAP. For my precise usecase, it's not testing per se : I actually have a manuscript that has a free interpretation of spaces, which makes it difficult even for human reader to decide. Some time, space were inserted in the transcription data where it should not have, sometime it did not. For further quantitative research, we created two models : one with space character, one without ; hence the need to run both models later (and not go through segmentation twice).

mittagessen commented 5 years ago

I assume it's something in scripta continua or simlar?

For further quantitative research, we created two models : one with space character, one without ; hence the need to run both models later (and not go through segmentation twice).

The segmentation is deterministic so it running it twice doesn't really hurt except being an additional expenditure of CPU cycles.

PonteIneptique commented 5 years ago

Being an additional expenditure of CPU cycles.

That's what I am trying to avoid. :) Better not run it every time I update my model. :)

I assume it's something in scripta continua or similar?

More or less, yes. There is space, but not always.

mrocr commented 4 years ago

@mittagessen Kindly close the topic

mittagessen commented 4 years ago

With the region code of the new segmenter we've got the ability to input multiple ALTO/PageXML files to supply existing segmentations to the ocr submodule.