ulb-sachsen-anhalt / digital-eval

Evaluate data from mass digitalization workflows
MIT License
5 stars 1 forks source link

Get error "contains no TextLine/Coords" (but these elements are available) #1

Closed stefanCCS closed 2 years ago

stefanCCS commented 2 years ago

Hi, I have tried out this nice looking tool. But, unfortunately I get an error like this:

digital-eval -v -ref gt/ ocr/
[INFO ] from "5" filtered "0" candidates missing groundtruth
[DEBUG] use 3 executors (4) to create evaluation data
[WARN ] 'gt/domain/subdomain/page01.gt.xml contains no TextLine/Coords!'
[WARN ] 'gt/domain/subdomain/page03.gt.xml contains no TextLine/Coords!'
[WARN ] 'gt/domain/subdomain/page02.gt.xml contains no TextLine/Coords!'
[WARN ] 'gt/domain/subdomain/page04.gt.xml contains no TextLine/Coords!'
[WARN ] 'gt/domain/subdomain/page05.gt.xml contains no TextLine/Coords!'
[DEBUG] processed 5, omitted 5 empty results
Traceback (most recent call last):
  File "/home/testadmin/digital-eval/bin/digital-eval", line 8, in <module>
    sys.exit(main())
  File "/home/testadmin/digital-eval/lib/python3.8/site-packages/digital_eval/cli.py", line 101, in main
    _main(path_candidates, path_ref, verbosity, xtra)
  File "/home/testadmin/digital-eval/lib/python3.8/site-packages/digital_eval/cli.py", line 53, in _main
    evaluator.aggregate(by_type=True)
  File "/home/testadmin/digital-eval/lib/python3.8/site-packages/digital_eval/evaluation.py", line 799, in aggregate
    self._check_aggregate_preconditions()
  File "/home/testadmin/digital-eval/lib/python3.8/site-packages/digital_eval/evaluation.py", line 840, in _check_aggregate_preconditions
    raise RuntimeError("missing evaluation data")
RuntimeError: missing evaluation data

Data I have used: digital-eval-test.zip

Can you please tell me, what I have made wrong?

M3ssman commented 2 years ago

Thanks for trying out!

Your data looks quite reasonable, it's just a format we've never came across before. To facilitate the geometric frame filter it uses rectangle information on word level. If this is missing, because groundtruth contains no word level data, in runs into trouble.

The PAGE data we used for evaluation came so far from corrected Transkribus PAGE 2013-XML data (and was therefore very strong coupled to this format). Of course, this limits the usability in other contexts, like yours. I do have a solution for this in mind.

Btw, what tool was used to create the groundtruth?

Currently the OCR-evaluation relies on word-level groundtruth data, which causes trouble when this information is missing. The provided test data lacks word-level data. I'm working on this right now. First step is to correct the error message above, since this is in fact yielded because no words were found. The second step is to be able to apply the geometric filter on line level when words are missing, since a lot of gt-datasets might be available at line level.

stefanCCS commented 2 years ago

Thanks for analyzing and improving. The data we have was made with a workflow like this:

M3ssman commented 2 years ago

Interesting!

The test sample data looks like all files are identical?

I've just made an update to the project's repository to enable line level evaluation, so please if you mind update your local clone, make a fresh pip install. Afterwards run digital-eval <root-path-ocr>/domain/ -ref <root-path-gt>/domain/ . You can also add the -v flags to get more insights.

Please note, that for the information retrival metrics the language mappings for language specific stopwords are somehow inexact. Here is a convention missing where and how this information is to be annotated, therefore I just guess german, english, arabic and now from your data also russian.

M3ssman commented 2 years ago

Please update your local clone, remove your previous installation with pip uninstall digital-eval and re-install again (version should print out 1.1.0).

Besides improvement with the stopwords for IR you can gain enhanced insights by adding -vv flag, which will show what data has been used for each metric in very detail.

stefanCCS commented 2 years ago

Many thanks for updating - tried out again (with same example: And yes, the files page01 to page05 are the same, just want to try out the tool, but there is a difference between GT and OCR).

Now I get this summary:

(digital-eval) testadmin@ubuntu-test:/tmp/digital-eval-test$ digital-eval  -ref gt/ ocr/
[INFO ] from "5" filtered "0" candidates missing groundtruth
[INFO ] Evaluation Summary for "ocr" vs. "gt (2022-08-03)
(digital-eval) testadmin@ubuntu-test:/tmp/digital-eval-test$

Of course I get a much more output, if I do -vor even -vv. But still, I was under the impression, that this summary should tell me something ... --> please clarify.

M3ssman commented 2 years ago

I guess it's because you called it from the very top, but it needs a start domain. Please try digital-eval -ref gt/domain ocr/domain.

The Summary does not search from the root level, it needs a starting domain as path. It is structured this way because our data sets reside all, say, under a shared volume like /data/ocr/groundtruth/<domain> (evaluation candidates are usually stored like /data/ocr/media/tiffor whatever input format).

In case of OCR-D3 we've structured our data like this

<root-path>/odem/
└──  ger/
└──  lat/

and called it digital-eval /data/ocr/media/jpeg/odem -ref /data/ocr/ocr/groundtruth

Or for newspapers with years like this:

<root-path>/zd1/
└──  <PPN-newspaper01>
             └── Year01-newspaper01
                          └── sample-01.gt.xml

To emphasize this assumption, do you suggest to alert something like evaluation root domains "gt" and "ocr" mismatch"?

stefanCCS commented 2 years ago

Yes, that's it - now I get this result: (digital-eval) testadmin@ubuntu-test:/tmp/digital-eval-test$ digital-eval -ref gt/domain/ ocr/domain/

[INFO ] from "5" filtered "0" candidates missing groundtruth
[INFO ] Evaluation Summary for "ocr/domain" vs. "gt/domain (2022-08-03)
[INFO ] "CCA@domain"    ∅: 97.72        5 items, 2415 refs, std: 0.00, median: 97.72
[INFO ] "CCA@domain/subdomain"  ∅: 97.72        5 items, 2415 refs, std: 0.00, median: 97.72
[INFO ] "CLA@domain"    ∅: 97.19        5 items, 1960 refs, std: 0.00, median: 97.19
[INFO ] "CLA@domain/subdomain"  ∅: 97.19        5 items, 1960 refs, std: 0.00, median: 97.19
[INFO ] "WBoW@domain"   ∅: 95.59        5 items, 340 refs, std: 0.00, median: 95.59
[INFO ] "WBoW@domain/subdomain" ∅: 95.59        5 items, 340 refs, std: 0.00, median: 95.59
[INFO ] "WWA@domain"    ∅: 95.59        5 items, 340 refs, std: 0.00, median: 95.59
[INFO ] "WWA@domain/subdomain"  ∅: 95.59        5 items, 340 refs, std: 0.00, median: 95.59

-> which leads to the main question: What does it mean (CCA, CLA; WBoW, WWA)?, what are the "refs" ?,...

Concerning your proposal "gt/ocr domain mismatch": Yes, maybe a "Warning" would be nice in this case.

M3ssman commented 2 years ago

These are abbreviations (CCA - Characterbased: Character Accuracy, CLA - Characterbased: Letter Accuracy, WWA: Wordbase: Word Accuracy ...), which - your totally pin the point - should be explained best in the README(?).

Refs differ in each context. For CCA it means groundtruth reference consists of 2.414 chars, for WWA it referrers to 340 words as also for WBoW (wordbased: Bag of Words), and so on.

M3ssman commented 2 years ago

I've updated the README according metrics and statistics. Feel free to re-test and close the issue if we're done!

stefanCCS commented 2 years ago

The explanations provided in README are pretty good. I will close this issue now. Many thanks for your support.