[Suggestion] Reorganize Kraken repo

ghost commented 7 years ago

Peace be upon you, here are some suggestions for you @mittagessen

Create repository kraken-ocr
For kraken-ocr Include: kraken Kraken Open Source OCR Engine (main repository) kraken-models Kraken recognition models for various languages (Beta) kraken-scripts Scripts to automate various aspects of Kraken kraken-clstm A small C++ implementation of LSTM networks, focused on OCR kraken-research Research and documents on Kraken
For kraken create a wiki, include: Kraken Ocr Part 1: Building CLSTM https://youtu.be/ST_XrfcCpKE Kraken Ocr Part 3: Creating and transcribing the HTML file https://youtu.be/No87TADb9zQ Kraken Ocr Part 4: Training a new CLSTM model https://youtu.be/Ec9Qi7S8cvA Also mention that it uses a modified version of the clstm separate-derivs
For kraken add tags of kraken kraken-ocr ocr-engine machine-learning
For kraken-scripts include: Training
For Training include:
- pretrain.sh
  !/bin/bash
  
  set -x set -a sort -R manifest.txt > /tmp/manifest2.txt sed 1,100d /tmp/manifest2.txt > train.txt sed 100q /tmp/manifest2.txt > test.txt

train.sh
!/bin/bash

set -x set -a report_every=1000 save_every=1000 maxtrain=50000 target_height=48 dewarp=center display_every=1000 test_every=1000 nhidden=100 lrate=1e-4 save_name=arabic clstmocrtrain train.txt test.txt

For kraken-clstm fork the clstm separate-derivs and modify clstm.h & extras.h by changing isnan to std::isnan
For kraken-research include the pdf of Important New Developments in Arabographic Optical Character Recognition also future research and recognition tests might be posted there in the future.

amitdo commented 7 years ago

Create repository kraken-ocr

@mittagessen, If you want you can ask Github to trasnform a forked repo to an independent repo. I think the issues and PRs will be kept.

ghost commented 7 years ago

@mittagessen Thanks to @amitdo for his help

For ketos linegen set --disable-degradation as default.

ghost commented 7 years ago

For kraken-scripts include eval.py

iShinJini commented 6 years ago

Hi! I'm very new to this and trying to learn. I need help for CLSTM for training purpose. I don't even know how to execute the code or train them. I try using "ketos train" and it prompts me unable to start from scratch.

I did Ketos Transcription and Ketos extract but left with training. is there any method to train without using Vagrant? Is there anyone can provide a guide on using Kraken CLSTM or Ketos for training?

mittagessen commented 6 years ago

Sorry for the delayed answer, I've been on vacation for a few weeks. The clstmocrtrain binary and train.sh script can also be found here and here. You will have to change the last line in train.sh to point to the clstmocrtrain binary location.

Mind, the pytorch branch is nearing completion so training with ketos train will work in a while (i fact it does already but is largely untested and there are probably lots of bugs).

iShinJini commented 6 years ago

@mittagessen Thanks for the reply and sorry for the trouble. May i know what is this problem ? i google it and could not find any fix for it

mittagessen commented 6 years ago

The script splits off 100 lines as a test set and you've got less than 101 lines of training data. Just adjust the lines to something lower than your number of lines:

sed 100q manifest.txt > test.txt
sed 1,100d manifest.txt > train.txt

iShinJini commented 6 years ago

@mittagessen Thank you !!

ghost commented 6 years ago

@mittagessen tmbdev just released a new repository for "Ocropy 2.0" at https://github.com/tmbdev/ocropy2 which he stated earlier that it will include improvements to the layout & text line analysis, GPU integration, along with various improvements. What do you think?

mittagessen commented 6 years ago

The layout analysis is certainly interesting, although it doesn't really change issues with complex semantic layout (newspaper, manuscripts, ...), subpar performance especially on Arabic, and reading order. His implementation replaces the signal processing line seed generation in the segmenter by a pixel classification network; spreading those seeds based on boxmaps and distance as in the original segmenter. Nevertheless it is a rather simple method to solve the line separation issue I've encountered using both object detection and pixel classification (of whole lines) networks.

The new ML backend using pytorch is somewhat similar to the kraken pytorch branch that's going to be the 1.0 release. The layers are presumably somewhat different as I oriented myself on VGSL but there shouldn't be major non-quality-of-life (serialization, backward compatibility, ...) differences

PS: There is also a similar trainable layout analysis at https://github.com/dhlab-epfl/dhSegment

ghost commented 6 years ago

@mittagessen what are your thoughts on:

By-the-way, I think tmbdev have moved-on to ocropy3

mittagessen commented 6 years ago

I haven't used calamari but I know the Wuerzburg people and it should do what it says on the tin. Although I have a personal aversion to ensemble methods the one they've implemented doesn't "feel" as arbitrary as many others.

Seam carving works for certain kinds of texts the current segmenter fails on and can be combined with something like https://github.com/mittagessen/seg to handle even fairly convoluted layouts with marginalia, decoration and interlinear notes. Unfortunately, it fails at Arabic script with vocalization and another system extracting columns, ordering lines, etc. is still needed.

ghost commented 6 years ago

@mittagessen There is a new paper released by NVidia called Noise2Noise, it shows a new method to clean/ de-noise images without the need of using clean ground-truth images, they train using noise. Have a look: https://www.youtube.com/watch?v=P0fMwA3X5KI https://arxiv.org/pdf/1803.04189.pdf https://news.developer.nvidia.com/ai-can-now-fix-your-grainy-photos-by-only-looking-at-grainy-photos/

ghost commented 5 years ago

@mittagessen please close this topic, I opened it a long time ago.

mittagessen / kraken

[Suggestion] Reorganize Kraken repo #57

!/bin/bash

!/bin/bash