Closed b2m closed 3 years ago
@b2m I want to reopen this, but your repo was deleted?
I know this PR is stopped (@b2m was that intentional?) but we would love to see an implementation of BoW in dinglehopper. Unfortunately, it is not as straightforward to agree on what we mean with "BoW", cf. discussion in https://github.com/OCR-D/spec/pull/225.
@bertsky did a quick review of this PR in that context:
bag_accuracy
itself looks good, but it's used bybag_of_words_accuracy
andbag_of_chars_accuracy
withWeights(1, 0, 1)
– i.e. only deletes, but no inserts (note that replacements will be zero). It's wrong to call this accuracy (it lends itself to recall).Also,
MetricResult.error_rate
seems to be wrong, because it uses the length of the reference as denominator (instead of length ofreference
plus length ofcompared
). Soaccuracy
would also be wrong.Together these two mistakes compensate each other, so that implementation calculates a valid recall (complement of false negative rate).
I think most of the functionality is here to implement BoW recall and accuracy, we just need to agree on what we mean by which and adapt. And since @b2m has already made the effort, including tests and code organization, it is a good starting point. Maybe we can discuss this at some point soon @mikegerber @bertsky @mweidling @b2m?
I know this PR is stopped (@b2m was that intentional?)
As I personally communicated with @mikegerber in 2021 I closed the PR as I no longer had the time to maintain it (or the other PRs).
If you find the code reusable feel free to open the PR again and modify it to your needs.
I agree with @bertsky that in hindsight the naming is quite unfortunate.
@b2m one always has some kind of pursuit curve with these things. Your implementation is the cleanest I have seen so far, we'll gladly refactor and update – many thanks!
@bertsky If you figure out how to get the code out of the closed PR, please do and open a fresh PR.
I agree mostly with everything and like to get this in. The HTML report might need some work and discussion.
@bertsky If you figure out how to get the code out of the closed PR, please do and open a fresh PR.
Pull requests are just branches in git, so you can check them out separately.
Assuming that the remote https://github.com/qurator-spk/dinglehopper/ is called origin
and your new local branch should be boc-and-bow
you can just use the following git command: git fetch origin pull/60/head:boc-and-bow
I'll try both, rebasing the PR's branch to master, or merging master into it, and see where I fare better.
This pull requests adds 2 new metrics: Bag of Chars and Bag of Words to dinglehopper.
While introducing new metrics I also refactored several parts of dinglehopper to be able to handle several metrics without violating DRY.
Here is an overview on what has changed:
Note that these changes are not backward compatible, as the parameter
metrics
is no longer a flag and the json report format was changed. As dinglehopper does not have versions or releases (yet) I am not sure whether backward compatibility is an issue at this stage.For future use a weights parameter already is enabled for boc and bow, but not for cer and wer. This parameter is not exposed at cli level (yet).
I may add another pull request converting cer and wer to be able to use weights also.
New Reports
Here is a screenshot of the new html report:
This is an extract from the new json report:
A note on how BoC and BoW are calculated
This approach can get extended to use bags (aka multiset or collections.Counter in Python) where each element also has a count.