Small script for converting ocropus data

solivr / tf-crnn

TensorFlow convolutional recurrent neural network (CRNN) for text recognition

GNU General Public License v3.0

292 stars 98 forks source link

Small script for converting ocropus data #39

Closed PonteIneptique closed 5 years ago

PonteIneptique commented 6 years ago

Hi @solivr ! The idea of evaluating your tool is really interesting to us. We have worked with Ocropus with medieval litterary manuscripts ( https://graal.hypotheses.org/786 ) and I am really looking forward comparing both results with handscript that are close to typescript. If you are in anyway interested, I'll post the small script I put together for converting our data for your input format ( https://github.com/PonteIneptique/ocropus-to-tf-crnn ) and keep you uptdated with the statistics :)

solivr commented 6 years ago

Hi @PonteIneptique Thanks for taking this initiative and sharing it! I am certainly interested in the statistics, so please keep me updated. :)

PonteIneptique commented 6 years ago

Hey :) Here are my first results, without touching anything of the original configuration though.

INFO:tensorflow:Finished evaluation at 2018-06-29-14:33:43
INFO - tensorflow - Finished evaluation at 2018-06-29-14:33:43
INFO:tensorflow:Saving dict for global step 2750: eval/CER = 0.035666704, eval/accuracy = 0.5859507, global_step = 2750, loss = inf
INFO - tensorflow - Saving dict for global step 2750: eval/CER = 0.035666704, eval/accuracy = 0.5859507, global_step = 2750, loss = inf

The loss=inf seems weird to me. I sometime run into this log

2018-06-29 14:29:34.027754: W tensorflow/core/util/ctc/ctc_loss_calculator.cc:144] No valid path found.

which apparently is tied to input : https://stackoverflow.com/questions/45130184/ctc-loss-error-no-valid-path-found

For your information, my training data look like

leseuesqes⁊leshautespersonesdedui

and perform around 95% with Ocropus 1 ( I don't remember the exact number).

solivr commented 6 years ago

Great that you made it work! Concerning the infinite loss it is indeed due to the tf.nn.ctc_loss in this line. Concerning the 95% performance, does this means that the CER is of 5% or that the WER is of 5% ? The printed info you get for the evaluation corresponds to CER and 1-WER (accuracy is not a very explicit name, I may change this...)

PonteIneptique commented 6 years ago

Ocropy is not evaluating WER unfortunately ;)

PonteIneptique commented 6 years ago

One of the things that interest me in this algorithm, compared to Ocropus, is speed of training + elasticity on unseen data. I am gonna evaluate this later though :)

SeguinBe commented 6 years ago

BTW I notice you seem to work on binarized data? You probably would get better result by using directly the images without binarization.

PonteIneptique commented 6 years ago

Yeah. Unfortunately, our training data were definitely done with these data so... Might be complicated to go back with the original ones

PonteIneptique commented 6 years ago

Maybe you know of a script which would allow me to match easily my binarized data with original ones ?

solivr commented 6 years ago

Would you mind sharing some numbers, like size of training set eval/test set and time of training ? I would be curious to know.

SeguinBe commented 6 years ago

I don't know ocropy format at all, but if you have a file where there are the bounding boxes coordinates of the line extraction, you could go back to the original document to extract the unaltered lines.

PonteIneptique commented 6 years ago

Unfortunately for some we lost this information cleaning some lines, hence the question :) I have the numbers on my Linux partition. It's 21.40 here, I'll get back to you :)

PonteIneptique commented 6 years ago

I am currently try to match back the images with OpenCV and get non-binarized, grayscale data.

Here are the informations :

2,961 lines referenced in output/test/groundtruth.csv
6,916 lines referenced in output/train/groundtruth.csv

PonteIneptique commented 6 years ago

So, I have tried to rematch my binarized training data to my current binarized data but I have not enough confidence in the result (My try-outs are available here : https://github.com/PonteIneptique/template_matching/blob/master/Ocropus%20Rematch%20.ipynb ) : the process behind these training data spanned over 3 years, and the original data are in different shape than today ones, and I am pretty sure looking at the file I lost the original columns ...

So, getting back to non-binarized would be a long process that cannot be fully automated with the limited knowledge I have in CV :/