Concatenate with newline for all-gt , remove sort uniq

Shreeshrii commented 3 years ago

See https://github.com/tesseract-ocr/tesstrain/issues/172#issuecomment-655246018

For large amount of training data, the current implementation hangs at times.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Shreeshrii commented 3 years ago

Please merge this.

Shreeshrii commented 3 years ago

@stweil Thanks for pointing out the performance issue.

The problem I was trying to solve is related to having ground-truth text files without \n at end. cat was making a really huge single line with it and sort|uniq would hang.

Would using paste work?


(base) ubuntu@tesseract-ocr-1:~/tesstrain-San$ time (find data/San-ground-truth/test -name "*.gt.txt"|xargs cat | sort | uniq >/tmp/old)

real    0m0.780s
user    0m0.688s
sys     0m0.261s
(base) ubuntu@tesseract-ocr-1:~/tesstrain-San$ time (find data/San-ground-truth/test -name "*.gt.txt"|xargs -I{} sh -c "cat {}; echo ''" >/tmp/new)

real    0m15.562s
user    0m11.121s
sys     0m5.300s
(base) ubuntu@tesseract-ocr-1:~/tesstrain-San$ time (find data/San-ground-truth/test -name "*.gt.txt"| xargs paste -s -d \n > /tmp/paste)

real    0m0.224s
user    0m0.071s
sys     0m0.189s
(base) ubuntu@tesseract-ocr-1:~/tesstrain-San$ find data/San-ground-truth/test -name "*.gt.txt"|wc -l
5982

Shreeshrii commented 3 years ago

awk maybe a better option.


find data/San-ground-truth/train -name "*.gt.txt"|wc -l
177605

time (find data/San-ground-truth/train -name "*.gt.txt"| xargs awk 'FNR==1{print ""}1' > /tmp/awk)

real    0m6.053s
user    0m2.374s
sys     0m6.745s

time (find data/San-ground-truth/train -name "*.gt.txt"|xargs cat | sort | uniq >/tmp/old)

real    0m20.928s
user    0m19.759s
sys     0m7.948s

stweil commented 3 years ago

Would a simple find data/San-ground-truth/train -name "*.gt.txt"|xargs cat >all-gt work for you, too? Then I'd prefer that as an intermediate solution.

It should not be necessary to avoid long lines, as they don't matter for unicharset_extractor. I had added sort and uniq to reduce the size of the resulting all-gt file.

My ideal solution would be a Python script which processes the ground truth text lines, extracts all characters and either passes a sorted list of characters to unicharset_extractor or directly creates the unicharset file. That script could optionally also count the frequency of the different characters.

stweil commented 3 years ago

find data/San-ground-truth/train -name "*.gt.txt"|xargs paste -s >all-gt works also good for me. It creates separate lines without adding unneeded line feeds, so the resulting all-gt is smaller than with awk.

Shreeshrii commented 3 years ago

@stweil OP's report is https://github.com/tesseract-ocr/tesstrain/issues/172

zip file with gt for testing

Log file showed:

Wrote unicharset file data/san9_test/my.unicharset
merge_unicharsets data/san/san9_test.lstm-unicharset data/san9_test/my.unicharset  "data/san9_test/unicharset"
Loaded unicharset of size 225 from file data/san/san9_test.lstm-unicharset
Loaded unicharset of size 3 from file data/san9_test/my.unicharset
Wrote unicharset file data/san9_test/unicharset.
unicharset of size 3 from file data/san9_test/my.unicharset

unicharset of size 3 from file data/san9_test/my.unicharset

feff (BOM) was part of many of the gt lines.

When single line ground truth text is being concatenated, it becomes one huge line and so if there is any error in even one file, the unicharset generation fails.

His dataset has gt.txt files with a single line, no linefeed. When transcription has newline at end, then the problem does not arise.

So, I would prefer a solution that adds the newlines, because I think that solves the problem. But there might be a better way.

Please test with his dataset and recommend the way forward.

stweil commented 3 years ago

My favourite solution for now is find data/San-ground-truth/train -name "*.gt.txt"|xargs paste -s >all-gt, see my comment above.

tesseract-ocr / tesstrain

Concatenate with newline for all-gt , remove sort uniq #215