tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

charfreq: use PCRE to operate on grapheme clusters instead of codepoints #375

Closed bertsky closed 4 months ago

stweil commented 4 months ago

This pull request was merged too fast, because the new code does not work at all on macOS:

% LC_ALL=C.UTF-8 grep -P -o /tmp/test2        
grep: invalid option -- P
usage: grep [-abcdDEFGHhIiJLlMmnOopqRSsUVvwXxZz] [-A num] [-B num] [-C[num]]
        [-e pattern] [-f file] [--binary-files=value] [--color=when]
        [--context[=num]] [--directories=action] [--label] [--line-buffered]
        [--null] [pattern] [file ...]
% grep --version
grep (BSD grep, GNU compatible) 2.6.0-FreeBSD

I'll fix that by using Python code which is more portable.

bertsky commented 4 months ago

@stweil then please just make a conditional on OS (as with dos2unix), and have Mac just use plain regexes on codepoints instead of grapheme clusters.

(Of course, this is doable in Python, but the point of that target is to be as simple and fast as possible, without dependencies.)