Feature Request: Character Frequency in Training Text

tesseract-ocr / tesstrain

Train Tesseract LSTM with make

Apache License 2.0

620 stars 181 forks source link

Feature Request: Character Frequency in Training Text #221

Open Shreeshrii opened 3 years ago

Shreeshrii commented 3 years ago

@stweil How to get a report like Analyse-Report Version 0.1 shown in NZZ wiki page?

kba commented 3 years ago

Looks a bit like output of https://github.com/eddieantonio/ocreval or https://github.com/impactcentre/ocrevalUAtion but PosixPath implies it's a Python tool producing this.

stweil commented 3 years ago

That report was generated by @JKamlah.

Shreeshrii commented 3 years ago

Thanks. I have seen such reports as part of accuracy output from ocreval. Krakengenerates them for both the training and testing sets. I think it will be useful to add it as part of tesstrain.

I found a python script which generates similar info. It is from https://github.com/cmroughan/kraken_generated-data in the tools directory. https://github.com/wincentbalin/pytesstrain also has some useful tools which generate a wordlist as well as unigram and bigram frequencies.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.