Create Character Count from training text

tesseract-ocr / tesstrain

Train Tesseract LSTM with make

Apache License 2.0

620 stars 181 forks source link

Create Character Count from training text #235

Closed Shreeshrii closed 5 months ago

Shreeshrii commented 3 years ago

Character frequency report Source: https://github.com/cmroughan/kraken_generated-data/blob/master/tools/count_chars.py

USAGE: count_chars.py | sort -n -r > .charcount

lgtm-com[bot] commented 3 years ago

This pull request introduces 1 alert when merging cec80b73976e0e589d1b1a32491c0f060621342f into 0d972f86f4aaf88fde77e3445ff607e68866c882 - view on LGTM.com

new alerts:

1 for Except block handles 'BaseException'

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bertsky commented 3 years ago

Fixes #221 IINM

Shreeshrii commented 3 years ago

@bertsky I copied the script from another repo and added an external sort as a quick hack to get the char count. Please feel free to modify as needed.

zdenop commented 2 years ago

Please make a description of input data and needed output, so those of us how did not have a look at the file can create a new script under Apache license.

bertsky commented 2 years ago

We cannot use GPL code in tesstrain with Apache license.

I don't think this short script meets the threshold of originality, though. After all, it just counts characters of files in Python. My above suggestions already changed most of the script's lines to make it more useful and makefile-workable. Just apply them and strip the Kraken reference. (I am not contesting Kraken's originality, only that single file's.)

bertsky commented 2 years ago

Also, it would help to offer a rule for the makefile already.

Besides, in 314e799 I proposed a similar functionality (only using shell means, i.e. grep -o . | sort | uniq -c | sort -rn) – it does not show codepoint names via unicodedata.name, but apart from that should be the same).

Or abandon this PR and just merge #260.

zdenop commented 1 year ago

Here is my python solution without need of extra tools. I did not implement reading form stdin as I do not see it usage in "make training"....

Usage: python3 count_chars.py data/foo/all-gt

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import sys
import unicodedata
from collections import Counter, OrderedDict

def show_char_frequency(string):
    string = string.replace(" ", "")
    dic = OrderedDict(Counter(string).most_common())
    for char in dic:
        print(f"{char}\t{dic[char]}\t{unicodedata.name(char)}")

def read_file(filename):
    with open(filename, encoding="utf-8", mode="rt") as fd:
        text_lines = fd.read().strip().split("\n")
    return " ".join(text_lines)

def main():
    if len(sys.argv) < 2:
        print(f"USAGE: {sys.argv[0]} <txt_file>")
        return 1

    filename = sys.argv[1]
    string = read_file(filename)
    show_char_frequency(string)
    return 0

if __name__ == "__main__":
    main()

stweil commented 5 months ago

Commit 29d394b7253c9f933a7fdf57f553d305576f9a5d merged the modifications (based on the original code) which were made in this pull request.