Closed stefanCCS closed 2 years ago
Thanks for trying out!
Your data looks quite reasonable, it's just a format we've never came across before. To facilitate the geometric frame filter it uses rectangle information on word level. If this is missing, because groundtruth contains no word level data, in runs into trouble.
The PAGE data we used for evaluation came so far from corrected Transkribus PAGE 2013-XML data (and was therefore very strong coupled to this format). Of course, this limits the usability in other contexts, like yours. I do have a solution for this in mind.
Btw, what tool was used to create the groundtruth?
Currently the OCR-evaluation relies on word-level groundtruth data, which causes trouble when this information is missing. The provided test data lacks word-level data. I'm working on this right now. First step is to correct the error message above, since this is in fact yielded because no words were found. The second step is to be able to apply the geometric filter on line level when words are missing, since a lot of gt-datasets might be available at line level.
Thanks for analyzing and improving. The data we have was made with a workflow like this:
Interesting!
The test sample data looks like all files are identical?
I've just made an update to the project's repository to enable line level evaluation, so please if you mind update your local clone, make a fresh pip install
. Afterwards run digital-eval <root-path-ocr>/domain/ -ref <root-path-gt>/domain/
.
You can also add the -v
flags to get more insights.
Please note, that for the information retrival metrics the language mappings for language specific stopwords are somehow inexact. Here is a convention missing where and how this information is to be annotated, therefore I just guess german, english, arabic and now from your data also russian.
Please update your local clone, remove your previous installation with pip uninstall digital-eval
and re-install again (version should print out 1.1.0).
Besides improvement with the stopwords for IR you can gain enhanced insights by adding -vv
flag, which will show what data has been used for each metric in very detail.
Many thanks for updating - tried out again (with same example: And yes, the files page01 to page05 are the same, just want to try out the tool, but there is a difference between GT and OCR).
Now I get this summary:
(digital-eval) testadmin@ubuntu-test:/tmp/digital-eval-test$ digital-eval -ref gt/ ocr/
[INFO ] from "5" filtered "0" candidates missing groundtruth
[INFO ] Evaluation Summary for "ocr" vs. "gt (2022-08-03)
(digital-eval) testadmin@ubuntu-test:/tmp/digital-eval-test$
Of course I get a much more output, if I do -v
or even -vv
.
But still, I was under the impression, that this summary should tell me something ...
--> please clarify.
I guess it's because you called it from the very top, but it needs a start domain.
Please try digital-eval -ref gt/domain ocr/domain
.
The Summary does not search from the root level, it needs a starting domain as path.
It is structured this way because our data sets reside all, say, under a shared volume like /data/ocr/groundtruth/<domain>
(evaluation candidates are usually stored like /data/ocr/media/tiff
or whatever input format).
In case of OCR-D3 we've structured our data like this
<root-path>/odem/
└── ger/
└── lat/
and called it digital-eval /data/ocr/media/jpeg/odem -ref /data/ocr/ocr/groundtruth
Or for newspapers with years like this:
<root-path>/zd1/
└── <PPN-newspaper01>
└── Year01-newspaper01
└── sample-01.gt.xml
To emphasize this assumption, do you suggest to alert something like evaluation root domains "gt" and "ocr" mismatch"
?
Yes, that's it - now I get this result: (digital-eval) testadmin@ubuntu-test:/tmp/digital-eval-test$ digital-eval -ref gt/domain/ ocr/domain/
[INFO ] from "5" filtered "0" candidates missing groundtruth
[INFO ] Evaluation Summary for "ocr/domain" vs. "gt/domain (2022-08-03)
[INFO ] "CCA@domain" ∅: 97.72 5 items, 2415 refs, std: 0.00, median: 97.72
[INFO ] "CCA@domain/subdomain" ∅: 97.72 5 items, 2415 refs, std: 0.00, median: 97.72
[INFO ] "CLA@domain" ∅: 97.19 5 items, 1960 refs, std: 0.00, median: 97.19
[INFO ] "CLA@domain/subdomain" ∅: 97.19 5 items, 1960 refs, std: 0.00, median: 97.19
[INFO ] "WBoW@domain" ∅: 95.59 5 items, 340 refs, std: 0.00, median: 95.59
[INFO ] "WBoW@domain/subdomain" ∅: 95.59 5 items, 340 refs, std: 0.00, median: 95.59
[INFO ] "WWA@domain" ∅: 95.59 5 items, 340 refs, std: 0.00, median: 95.59
[INFO ] "WWA@domain/subdomain" ∅: 95.59 5 items, 340 refs, std: 0.00, median: 95.59
-> which leads to the main question: What does it mean (CCA, CLA; WBoW, WWA)?, what are the "refs" ?,...
Concerning your proposal "gt/ocr domain mismatch": Yes, maybe a "Warning" would be nice in this case.
These are abbreviations (CCA - Characterbased: Character Accuracy, CLA - Characterbased: Letter Accuracy, WWA: Wordbase: Word Accuracy ...), which - your totally pin the point - should be explained best in the README(?).
Refs differ in each context. For CCA it means groundtruth reference consists of 2.414 chars, for WWA it referrers to 340 words as also for WBoW (wordbased: Bag of Words), and so on.
I've updated the README according metrics and statistics. Feel free to re-test and close the issue if we're done!
The explanations provided in README are pretty good. I will close this issue now. Many thanks for your support.
Hi, I have tried out this nice looking tool. But, unfortunately I get an error like this:
Data I have used: digital-eval-test.zip
Can you please tell me, what I have made wrong?