solivr / tf-crnn

TensorFlow convolutional recurrent neural network (CRNN) for text recognition
GNU General Public License v3.0
292 stars 98 forks source link

How should I build a character dictionary config file #40

Closed PonteIneptique closed 6 years ago

PonteIneptique commented 6 years ago

I am gonna run the first test on the data I have but there is no documentation about how I should recreate a similar file as https://github.com/solivr/tf-crnn/blob/master/tf_crnn/data/lookup_letters_digits_symbols.json and of course I am already running into an error :) :

Traceback (most recent calls WITHOUT Sacred internals):
  File "train.py", line 79, in run
    discarded_chars=discarded_chars)
  File "/script/tf_crnn/config.py", line 69, in check_input_file_alphabet
    filename, extra_chars)
AssertionError: There are 30 unknown chars in /sources/output/train/groundtruth.csv :
 {'\u0367', '\ua76f', '\u036c', '\u014d', '\u2022', '\u1e8f', '\u1e6b', '\u1de4',
 '\u0364', '\ua759', '\xf1', '\u0129', '\u1d49', '\u0291', '\u204a', '\ue5dc', 
'\u01ba', '\u0363', '\ua751', '\u0365', '\u0113', '\u1d48', '\u0167', '\ufeff', 
'\xe9', '\u016b', '\u0305', '\ue681', '\u0153', '\u0101'}

Any hint how it should be built would be greatly appreciated. I'll probably add it to my ocropus script :)

PonteIneptique commented 6 years ago

It took me a bit of time but I did it.

  1. Build a file where characters are separated by newline (one character = one line)
  2. Run, in python :
from tf_crnn.hlp.alphabet_helpers import get_alphabet_units_form_csv, make_json_lookup_alphabet
from json import dump

with open("/path/to/chars.json", "w") as f:
    dump(make_json_lookup_alphabet(get_alphabet_units_form_csv("/your/file/with/one/char/a/line.csv")), f)