qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
59 stars 13 forks source link

Support comparing line GT directories with line OCR directories #64

Closed mikegerber closed 2 years ago

mikegerber commented 2 years ago

In #62, @stweil's original problem was - as I understand it - to compare a directory with line GT text files with a directory of line OCR text files. For now I've created fake test data to implement this fake-line-gt.zip. It looks like this:

% ls *
gt:
line001.gt.txt  line003.gt.txt  line005.gt.txt  line007.gt.txt  line009.gt.txt  line011.gt.txt
line002.gt.txt  line004.gt.txt  line006.gt.txt  line008.gt.txt  line010.gt.txt

some-ocr:
line001.some-ocr.txt  line003.some-ocr.txt  line005.some-ocr.txt  line007.some-ocr.txt  line009.some-ocr.txt  line011.some-ocr.txt
line002.some-ocr.txt  line004.some-ocr.txt  line006.some-ocr.txt  line008.some-ocr.txt  line010.some-ocr.txt

A first implementation should compare the text of pairs files (matching by filename) and produce a report of metrics & differences over all of the lines. First idea of the CLI interface:

dinglehopper-lines gt/ --gt-suffix .gt.txt some-ocr/ --ocr-suffix .some-ocr.txt

I'm not sure if this will be the final CLI interface but it's what seems necessary on first glance.

stweil commented 2 years ago

What about an even simpler interface:

dinglehopper [OPTIONS] GTDIR OCRDIR [REPORT_PREFIX]

The existing dinglehopper could be extended to accept directory names for its GT and OCR argument and then either strip all extensions when matching ground truth and ocr lines by default or use new optional --gt-suffix and --ocr-suffix options.

mikegerber commented 2 years ago

What about an even simpler interface:

dinglehopper [OPTIONS] GTDIR OCRDIR [REPORT_PREFIX]

The existing dinglehopper could be extended to accept directory names for its GT and OCR argument

For now and until the interface is finalized I'd like to keep the CLI interface separate, it will share the code anyway.

and then either strip all extensions when matching ground truth and ocr lines by default or use new optional --gt-suffix and --ocr-suffix options.

For the stripping of all extensions to work we would need to assume that the common prefix for a pair does not contain a dot, and the explicit suffix options seemed saner.

But I think I'll start implementing this, CLI details can still be refined later.

mikegerber commented 2 years ago

For the stripping of all extensions to work we would need to assume that the common prefix for a pair does not contain a dot, and the explicit suffix options seemed saner.

They will default to something useful: the longest common suffix, i.e.

import itertools

def all_equal(iterable):
    g = itertools.groupby(iterable)
    return next(g, True) and not next(g, False)

def common_prefix(its):
    return [p[0] for p in itertools.takewhile(all_equal, zip(*its))]

def common_suffix(its):
    return reversed(common_prefix(reversed(it) for it in its))

#print("".join(common_prefix(["line001.gt.txt", "line02.gt.txt", "line3.gt.txt"])))
print("".join(common_suffix(["line001.gt.txt", "line02.gt.txt", "line3.gt.txt"])))

(gives .gt.txt)

mikegerber commented 2 years ago

dinglehopper-line-dirs gt some-ocr from the feat/compare-line-texts branch now compares the line texts from the gt and some-ocr. It auto-detects the file suffixes. It's WIP - but only WER and word differences are missing.

@stweil Could you test if this works for you?

image

mikegerber commented 2 years ago

The lines also line up perfectly, because each pair is put into its own <div class="row">!

stweil commented 2 years ago

My first test fails:

dinglehopper-line-dirs gt frak2021_1.069 frak2021_1.069
free(): invalid next size (fast)
Aborted
stweil commented 2 years ago

The crash happens in rapidfuzz-1.9.0-py3.9-linux-x86_64.egg/rapidfuzz/cpp_string_metric.cpython-39-x86_64-linux-gnu.so.

stweil commented 2 years ago

@maxbachmann, I now tried to debug the RapidFuzz code, but pip install . fails:

 src/cpp_common.hpp:4:10: fatal error: rapidfuzz/fuzz.hpp: No such file or directory
mikegerber commented 2 years ago

I can't reproduce with Python 3.9 and rapidfuzz-1.9.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl. Hmm. Could we have a look at the data (or a portion of it) that triggers this?

maxbachmann commented 2 years ago

@stweil did you clone the repository including submodules?

git clone --recursive git@github.com:maxbachmann/RapidFuzz.git

As @mikegerber mentioned it would help if you could provide me with some data to reproduce this.

stweil commented 2 years ago

Minimal single line test case (found by bisecting the original large test set):

mkdir a b
echo "Vorjahres.“ (24 % gegenüber 42 %. Daneben auch Anſtiege um 11 %, 22 %, 34 %," >a/demo.txt
echo "PVorſahres.“ (24 0% gegenüber 42 95, Daneben auch Anſtiege um 11 % 22 % 34" >b/demo.txt
dinglehopper-line-dirs a b c
stweil commented 2 years ago

did you clone the repository including submodules?

No, I did not. The installation works after git submodule update --init. I suggest to add that information to the instructions in the README.

maxbachmann commented 2 years ago

Minimal single line test case (found by bisecting the original large test set):

thanks I could reproduce the crash. I will look into it

maxbachmann commented 2 years ago

Ouch, I had a typo in the edit distance calculation: https://github.com/maxbachmann/rapidfuzz-cpp/commit/103674db0785f6c1c8e247abc850e48c75c22e1c I am honestly surprised, that this never crash on the input of a fuzz testing tool ...

I released a new version of RapidFuzz with the fix: https://github.com/maxbachmann/RapidFuzz/releases/tag/v1.9.1

mikegerber commented 2 years ago

Great this bug is fixed. I've bumped the rapidfuzz dependency to >=1.9.1!

@stweil Could you try https://github.com/qurator-spk/dinglehopper/tree/feat/compare-line-texts again, after updating?

mikegerber commented 2 years ago

The feat/compare-line-text branch now also computes WER and word differences. So, if it's tested, it's ready.

stweil commented 2 years ago

A new test with the latest code shows that the memory issue is fixed, but with the full test set I get a new error (an endless recursion in word_error_rate.py line 25, test data is available online):

$ dinglehopper-line-dirs a b c
Traceback (most recent call last):
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/bin/dinglehopper-line-dirs", line 11, in <module>
    load_entry_point('dinglehopper==0.0.0', 'console_scripts', 'dinglehopper-line-dirs')()
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/click-8.0.3-py3.9.egg/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/click-8.0.3-py3.9.egg/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/click-8.0.3-py3.9.egg/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/click-8.0.3-py3.9.egg/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/cli_line_dirs.py", line 138, in main
    process(gt, ocr, report_prefix, metrics=metrics)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/cli_line_dirs.py", line 67, in process
    l_wer, l_n_words = word_error_rate_n(gt_text, ocr_text)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/multimethod-1.3-py3.9.egg/multimethod.py", line 171, in __call__
    return self[tuple(map(self.get_type, args))](*args, **kwargs)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 76, in word_error_rate_n
    return word_error_rate_n(reference.text, compared.text)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/multimethod-1.3-py3.9.egg/multimethod.py", line 171, in __call__
    return self[tuple(map(self.get_type, args))](*args, **kwargs)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 68, in word_error_rate_n
    compared_seq = list(words_normalized(compared))
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 43, in words
    for word in uniseg.wordbreak.words(s):
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/breaking.py", line 59, in break_units
    for j, bk in enumerate(breakables):
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/wordbreak.py", line 185, in word_breakables
    primitive_boundaries = list(_preprocess_boundaries(s))
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/wordbreak.py", line 153, in _preprocess_boundaries
    prop = word_break(c)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 25, in new_word_break
    return old_word_break(c, index)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 25, in new_word_break
    return old_word_break(c, index)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/qurator/dinglehopper/word_error_rate.py", line 25, in new_word_break
    return old_word_break(c, index)
  [Previous line repeated 975 more times]
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/wordbreak.py", line 129, in word_break
    return _word_break(code_point(c, index))
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/db.py", line 75, in word_break
    (ord(u),))
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/codepoint.py", line 127, in ord
    return ord_impl(c, index)
  File "/home/stweil/src/github/tesseract-ocr/tesstrain/data/gruenderfinetune25-ground-truth/ocr/venv/lib/python3.9/site-packages/uniseg-0.7.1.post2-py3.9.egg/uniseg/codepoint.py", line 75, in ord_impl
    return _ord(c if index is None else c[index])
RecursionError: maximum recursion depth exceeded while calling a Python object
stweil commented 2 years ago

With commits cb2be96179543dba6ac069c92b842c1f56c198ec and 5b394649a7777f95932ab74c1e26743e8e180849 reverted (= no WER), my full data set is processed in 5 seconds (no crash).

mikegerber commented 2 years ago

Great that half of it is working now! Unfortunately I'm on vacation now, so triaging the WER problem will have to wait until January. Thanks for the test data, this will help greatly!

mikegerber commented 2 years ago

I've found the problem and fixed it in 8a3f5e48c2eac3e6d67f84e87409b8c69a1e150b! The feature is now merged.

% /usr/bin/time -f'%e %M' dinglehopper-line-dirs a b
2.19 54028

~ 2 seconds and max. 55MB memory for your example data! 🍾

mikegerber commented 2 years ago

@stweil Let me know if that's working for you! I'll close this issue, feel free to re-open or open another issue if something's still wrong.

mikegerber commented 2 years ago

@stweil Did you run the latest version on your full data? Did it work?