mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
745 stars 131 forks source link

is it possible to train Kraken from XML ALTO #190

Closed alix-tz closed 4 years ago

alix-tz commented 4 years ago

Hello,

Is it possible to use XML ALTO files (the one exported from eScriptorium) to train Kraken from the command line? Is it possible for both segmentation AND transcription?

We are currently in a situation where we need to be able to perform training somewhere else than via eScriptorium due to the absence of GPU on our server.

dstoekl commented 4 years ago

yes 🙂


Please see my Dead Sea Scrolls textbook published at UTB/Mohr-Siebeck: www.utb-shop.de/9783825246815http://www.utb-shop.de/9783825246815


De : Alix Chagué notifications@github.com Envoyé : mercredi 1 avril 2020 16:22 À : mittagessen/kraken kraken@noreply.github.com Cc : Subscribed subscribed@noreply.github.com Objet : [mittagessen/kraken] is it possible to train Kraken from XML ALTO (#190)

Hello,

Is it possible to use XML ALTO files (the one exported from eScriptorium) to train Kraken from the command line? Is it possible for both segmentation AND transcription?

We are currently in a situation where we need to be able to perform training somewhere else than via eScriptorium due to the absence of GPU on our server.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/mittagessen/kraken/issues/190, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJNUE3Y7AJMUSZU4D2547JDRKNEYVANCNFSM4LY7ZRXA.

alix-tz commented 4 years ago

Awesome! With which version of Kraken : master, blla or blla_regions?

Is there any documentation?

mittagessen commented 4 years ago

I'm still writing documentation for it but, yes. The basic syntax is:

ketos [train|segtrain] -f page *.xml

Enable augmentation (needs the albumentations package) with --augment, especially for the segmenter it improves generalization considerably.

Regions are picked up automatically from the PageXML for segmenter training. If you don't want to train these add the --suppress-regions switch (there's also a --suppress-baselines switch for the opposite). One can also filter and merge baselines and regions with the --[valid|merge]-[regions|baselines] options.

You have to take some care about what image will be fed to the networks. The default segmentation network takes RGB images so any input image (binary, grayscale, color) is fine. The default recognition network has 1-channel inputs, RGB images will be auto-converted to grayscale. If you want to train on binary data, set the --force-binarization switch. The ocr subcommand will print a scary warning if you feed in incompatible images.

EDIT: I'm going to merge blla_regions into blla this week. For segmenter training I recommend using blla_regions as the models are not compatible between those two branches (not enough people have trained models on blla to make going through the work to have a legacy code path worth it). For the recognizer it doesn't matter.

alix-tz commented 4 years ago

Hi Ben,

EDIT: I'm going to merge blla_regions into blla this week. For segmenter training I recommend using blla_regions as the models are not compatible between those two branches (not enough people have trained models on blla to make going through the work to have a legacy code path worth it). For the recognizer it doesn't matter.

Since you now merged blla_regions in blla, we're good right?

mittagessen commented 4 years ago

Since you now merged blla_regions in blla, we're good right?

Yup. If something doesn't work please open a bug report. It was a fairly large merge with quite a bit of manual conflict resolution.

gabays commented 4 years ago

Hello, I cannot find the blla branch, nor the segtrain command anywhere. Is the documentation for segmentation training published? Best, Simon

dstoekl commented 4 years ago

Dear Simon The branch has been merged many months ago into the master branch. segtrain is similar to train with the resize option to train on top of e.g. cbad models.

it would be good to agree on a basic ontology. For my literary manuscripts, I use the regiontypes: "Main", "Margin" (marginal notes belonging to main text), "Paratext" (custodes, marginal numbers, running headers), "Commentary" (secondary text), "Title" (rare) linetypes default (you dont specify it in the UI regular lines), correction (for interlinear additions) and numbering (e.g. for tiny letters/numbers above words indicating reorder of words)

"help" gives you all options. You can e.g. turn off training for regions and/or linetypes or merge classes.

The most recent kraken version has a much improved posttreatment for line segmentation.

Kindly Daniel


De : Simon Gabay notifications@github.com Envoyé : samedi 31 octobre 2020 14:34 À : mittagessen/kraken kraken@noreply.github.com Cc : Daniel Stoekl Daniel.Stoekl@ephe.psl.eu; Comment comment@noreply.github.com Objet : Re: [mittagessen/kraken] is it possible to train Kraken from XML ALTO (#190)

Hello, I cannot find the blla branch, nor the segtrain command anywhere. Is the documentation for segmentation training published? Best, Simon

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/mittagessen/kraken/issues/190#issuecomment-719934858, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJNUE37BIHESG7IOGRHBEETSNQG7JANCNFSM4LY7ZRXA.

gabays commented 4 years ago

Dear Daniel,

Thank you for your quick answer, it is just that when I pip install kraken I get that:

(env) gabays@gabays-computer:~/app/kraken$ ketos segtrain -f Data_OCRcat_typo/*/ALTO4eScriptorium/*xml
Usage: ketos [OPTIONS] COMMAND [ARGS]...
Try 'ketos --help' for help.

Error: No such command 'segtrain'.
(env) gabays@gabays-computer:~/app/kraken$ ketos --help
Usage: ketos [OPTIONS] COMMAND [ARGS]...

Options:
  --version           Show the version and exit.
  -v, --verbose
  -s, --seed INTEGER  Seed for numpy's and torch's RNG. Set to a fixed value
                      to ensure reproducable random splits of data

  --help              Show this message and exit.

Commands:
  extract     Extracts image-text pairs from a transcription environment...
  linegen     Generates artificial text line training data.
  publish     Publishes a model on the zenodo model repository.
  test        Evaluate on a test set.
  train       Trains a model from image-text pairs.
  transcribe  Creates transcription environments for ground truth...

Regarding zones, if you pinpoint to your training data I would be more than happy to adapt mine to your recommendations.

Finally, I have training data prepared with Transkribus I would like to use.

Can y mix these two types of data for training?

Thank you so much for the help!

Simon

jjarosch commented 3 years ago

when I pip install kraken I get that:

I also installed kraken via pip, and ketos --version says 2.0.8 – which is over a year old. The discussion above seems to refer to much more recent versions.

I’m now trying to install the most recent release from the cloned repo (pip install . within the repo directory).