qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
59 stars 13 forks source link

Add new metrics BoC and BoW #60

Closed b2m closed 3 years ago

b2m commented 3 years ago

This pull requests adds 2 new metrics: Bag of Chars and Bag of Words to dinglehopper.

While introducing new metrics I also refactored several parts of dinglehopper to be able to handle several metrics without violating DRY.

Here is an overview on what has changed:

Note that these changes are not backward compatible, as the parameter metrics is no longer a flag and the json report format was changed. As dinglehopper does not have versions or releases (yet) I am not sure whether backward compatibility is an issue at this stage.

For future use a weights parameter already is enabled for boc and bow, but not for cer and wer. This parameter is not exposed at cli level (yet).

I may add another pull request converting cer and wer to be able to use weights also.

New Reports

Here is a screenshot of the new html report: report-simple

This is an extract from the new json report:

{
    "gt": "gt.txt",
    "ocr": "ocr.txt",
    "character_accuracy": {
        "weights": {
            "deletes": 1,
            "inserts": 1,
            "replacements": 1
        },
        "weighted_errors": 1,
        "reference_elements": 5,
        "compared_elements": 5,
        "accuracy": 0.8,
        "error_rate": 0.2
    },
    ...
}

A note on how BoC and BoW are calculated

  1. Imagine gt and ocr as sets.
  2. An element in gt and not in ocr is considered a delete.
  3. An element in ocr and not in gt is considered an insert.
    gt, ocr = set(gt_text), set(ocr_text)
    deletes = gt - ocr
    inserts = ocr - gt
  4. Replacements are only calculated when the cost of a replacement is lower than that of a combined insert and delete. For replacements we are taking arbitrary pairs from gt and ocr until one is exhaustet.

This approach can get extended to use bags (aka multiset or collections.Counter in Python) where each element also has a count.

mikegerber commented 3 years ago

@b2m I want to reopen this, but your repo was deleted?

kba commented 1 year ago

I know this PR is stopped (@b2m was that intentional?) but we would love to see an implementation of BoW in dinglehopper. Unfortunately, it is not as straightforward to agree on what we mean with "BoW", cf. discussion in https://github.com/OCR-D/spec/pull/225.

@bertsky did a quick review of this PR in that context:

bag_accuracy itself looks good, but it's used by bag_of_words_accuracy and bag_of_chars_accuracy with Weights(1, 0, 1) – i.e. only deletes, but no inserts (note that replacements will be zero). It's wrong to call this accuracy (it lends itself to recall).

Also, MetricResult.error_rate seems to be wrong, because it uses the length of the reference as denominator (instead of length of reference plus length of compared). So accuracy would also be wrong.

Together these two mistakes compensate each other, so that implementation calculates a valid recall (complement of false negative rate).

I think most of the functionality is here to implement BoW recall and accuracy, we just need to agree on what we mean by which and adapt. And since @b2m has already made the effort, including tests and code organization, it is a good starting point. Maybe we can discuss this at some point soon @mikegerber @bertsky @mweidling @b2m?

b2m commented 1 year ago

I know this PR is stopped (@b2m was that intentional?)

As I personally communicated with @mikegerber in 2021 I closed the PR as I no longer had the time to maintain it (or the other PRs).

If you find the code reusable feel free to open the PR again and modify it to your needs.

I agree with @bertsky that in hindsight the naming is quite unfortunate.

bertsky commented 1 year ago

@b2m one always has some kind of pursuit curve with these things. Your implementation is the cleanest I have seen so far, we'll gladly refactor and update – many thanks!

mikegerber commented 1 year ago

@bertsky If you figure out how to get the code out of the closed PR, please do and open a fresh PR.

I agree mostly with everything and like to get this in. The HTML report might need some work and discussion.

b2m commented 1 year ago

@bertsky If you figure out how to get the code out of the closed PR, please do and open a fresh PR.

Pull requests are just branches in git, so you can check them out separately.

Assuming that the remote https://github.com/qurator-spk/dinglehopper/ is called origin and your new local branch should be boc-and-bow you can just use the following git command: git fetch origin pull/60/head:boc-and-bow

bertsky commented 1 year ago

I'll try both, rebasing the PR's branch to master, or merging master into it, and see where I fare better.