mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
720 stars 130 forks source link

[Suggestion] Reorganize Kraken repo #57

Closed ghost closed 5 years ago

ghost commented 7 years ago

Peace be upon you, here are some suggestions for you @mittagessen

  1. Create repository kraken-ocr

  2. For kraken-ocr Include: kraken Kraken Open Source OCR Engine (main repository) kraken-models Kraken recognition models for various languages (Beta) kraken-scripts Scripts to automate various aspects of Kraken kraken-clstm A small C++ implementation of LSTM networks, focused on OCR kraken-research Research and documents on Kraken

  3. For kraken create a wiki, include: Kraken Ocr Part 1: Building CLSTM https://youtu.be/ST_XrfcCpKE Kraken Ocr Part 3: Creating and transcribing the HTML file https://youtu.be/No87TADb9zQ Kraken Ocr Part 4: Training a new CLSTM model https://youtu.be/Ec9Qi7S8cvA Also mention that it uses a modified version of the clstm separate-derivs

  4. For kraken add tags of kraken kraken-ocr ocr-engine machine-learning

  5. For kraken-scripts include: Training

  6. For Training include:

    • pretrain.sh

      !/bin/bash

      set -x set -a sort -R manifest.txt > /tmp/manifest2.txt sed 1,100d /tmp/manifest2.txt > train.txt sed 100q /tmp/manifest2.txt > test.txt

  1. For kraken-clstm fork the clstm separate-derivs and modify clstm.h & extras.h by changing isnan to std::isnan

  2. For kraken-research include the pdf of Important New Developments in Arabographic Optical Character Recognition also future research and recognition tests might be posted there in the future.

amitdo commented 7 years ago

Create repository kraken-ocr

@mittagessen, If you want you can ask Github to trasnform a forked repo to an independent repo. I think the issues and PRs will be kept.

ghost commented 7 years ago

@mittagessen Thanks to @amitdo for his help

  1. For ketos linegen set --disable-degradation as default.
ghost commented 7 years ago
  1. For kraken-scripts include eval.py
iShinJini commented 6 years ago

Hi! I'm very new to this and trying to learn. I need help for CLSTM for training purpose. I don't even know how to execute the code or train them. I try using "ketos train" and it prompts me unable to start from scratch.

I did Ketos Transcription and Ketos extract but left with training. is there any method to train without using Vagrant? Is there anyone can provide a guide on using Kraken CLSTM or Ketos for training?

mittagessen commented 6 years ago

Sorry for the delayed answer, I've been on vacation for a few weeks. The clstmocrtrain binary and train.sh script can also be found here and here. You will have to change the last line in train.sh to point to the clstmocrtrain binary location.

Mind, the pytorch branch is nearing completion so training with ketos train will work in a while (i fact it does already but is largely untested and there are probably lots of bugs).

iShinJini commented 6 years ago

@mittagessen Thanks for the reply and sorry for the trouble. May i know what is this problem ? i google it and could not find any fix for it image

mittagessen commented 6 years ago

The script splits off 100 lines as a test set and you've got less than 101 lines of training data. Just adjust the lines to something lower than your number of lines:

sed 100q manifest.txt > test.txt
sed 1,100d manifest.txt > train.txt
iShinJini commented 6 years ago

@mittagessen Thank you !!

ghost commented 6 years ago

@mittagessen tmbdev just released a new repository for "Ocropy 2.0" at https://github.com/tmbdev/ocropy2 which he stated earlier that it will include improvements to the layout & text line analysis, GPU integration, along with various improvements. What do you think?

mittagessen commented 6 years ago

The layout analysis is certainly interesting, although it doesn't really change issues with complex semantic layout (newspaper, manuscripts, ...), subpar performance especially on Arabic, and reading order. His implementation replaces the signal processing line seed generation in the segmenter by a pixel classification network; spreading those seeds based on boxmaps and distance as in the original segmenter. Nevertheless it is a rather simple method to solve the line separation issue I've encountered using both object detection and pixel classification (of whole lines) networks.

The new ML backend using pytorch is somewhat similar to the kraken pytorch branch that's going to be the 1.0 release. The layers are presumably somewhat different as I oriented myself on VGSL but there shouldn't be major non-quality-of-life (serialization, backward compatibility, ...) differences

PS: There is also a similar trainable layout analysis at https://github.com/dhlab-epfl/dhSegment

ghost commented 6 years ago

@mittagessen what are your thoughts on:

By-the-way, I think tmbdev have moved-on to ocropy3

mittagessen commented 6 years ago

I haven't used calamari but I know the Wuerzburg people and it should do what it says on the tin. Although I have a personal aversion to ensemble methods the one they've implemented doesn't "feel" as arbitrary as many others.

Seam carving works for certain kinds of texts the current segmenter fails on and can be combined with something like https://github.com/mittagessen/seg to handle even fairly convoluted layouts with marginalia, decoration and interlinear notes. Unfortunately, it fails at Arabic script with vocalization and another system extracting columns, ordering lines, etc. is still needed.

ghost commented 6 years ago

@mittagessen There is a new paper released by NVidia called Noise2Noise, it shows a new method to clean/ de-noise images without the need of using clean ground-truth images, they train using noise. Have a look: https://www.youtube.com/watch?v=P0fMwA3X5KI https://arxiv.org/pdf/1803.04189.pdf https://news.developer.nvidia.com/ai-can-now-fix-your-grainy-photos-by-only-looking-at-grainy-photos/

ghost commented 5 years ago

@mittagessen please close this topic, I opened it a long time ago.