qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
58 stars 12 forks source link

only call `words_normalized` once #72

Closed maxbachmann closed 7 months ago

maxbachmann commented 1 year ago

words_normalized should only be called once, since it is quite slow, which has a large effect now that the string matching is faster. On my laptop this achieves the following performance improvement: Before:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
    0:15.89 real,   9.61 user,  7.14 sys,   92704 mmem

After:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
    0:12.56 real,   7.88 user,  5.56 sys,   92836 mmem

@mikegerber's task list:

maxbachmann commented 1 year ago

I did now move the grapheme cluster calculation into ExtractedText, so it only has to be calculated once for each sequences. Before it was calculated twice for one of the sequences and three times for the other one. This further improves the runtime to:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
    0:08.66 real,   6.04 user,  3.51 sys,   96860 mmem

So with this data I achieve nearly a 50% speedup.

maxbachmann commented 1 year ago

I replaced uniseg with uniseg2. This is a fork of uniseg I just created, which is still Python only as well, but significantly improves the performance of all functions in uniseg. This improves the runtime of dinglehopper by another 50% and reduces the memory consumption by 7mb (instead of a 7.5mb sqlite table it only loads a less than 500kb of binary strings)

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
    0:04.20 real,   4.06 user,  0.79 sys,   89472 mmem
maxbachmann commented 1 year ago

I replaced uniseg with uniseg2.

I am currently in contact with the uniseg author in order to get these improvements into the library.

maxbachmann commented 1 year ago

I opened a PR for the original library as well: https://bitbucket.org/emptypage/uniseg-py/pull-requests/1. Not sure when he will come around to review it though.

maxbachmann commented 1 year ago

I improved the editops implementation further and added a score_hint, since we already know the edit distance, which improves the runtime further to:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
    0:02.47 real,   2.67 user,  0.71 sys,   89060 mmem
maxbachmann commented 1 year ago

My performance improvements did now get added to uniseg, so it is not required to use the forked version anymore.

@mikegerber this is now ready for review

mikegerber commented 1 year ago

Sorry for not reacting earlier, I had some health issues!

maxbachmann commented 1 year ago

Sorry for not reacting earlier, I had some health issues!

No hurry :) I hope you're better now

maxbachmann commented 1 year ago

@mikegerber it appears you renamed the files used in this PR, should I reapply my changes to the new master branch?

mikegerber commented 1 year ago

@mikegerber it appears you renamed the files used in this PR, should I reapply my changes to the new master branch?

Yes, we had some issues with the namespaces of our packages. You can rebase if you want or I'll do it when I make time. Sorry I didn't merge it yet, I had some (presumably small) issues with it the last time I tried (tests failing). Still want it merged!

mikegerber commented 10 months ago

Heads up: I'm working on merging this again.

I'm first looking into fixing running the tests using this PR, then I'm rebasing.

mikegerber commented 10 months ago

Note to self: Should also move on to RapidFuzz 3 if possible.

mikegerber commented 10 months ago

I'm first looking into fixing running the tests using this PR, then I'm rebasing.

After cherry-picking some commits from master into this PR (to fix errors and warnings we already had fixed), only one test (test_cli_json_cer_is_infinity) fails now. Working on fixing/reviewing that one.

(Not rebased yet.)

mikegerber commented 10 months ago

Tests run now, but I want to review the edge cases again in the coming days. (error rate x count can be infinite or nan (if count is 0))

mikegerber commented 9 months ago
* [ ]  Review RapidFuzz dependency. The one we have may be need an update for this. (tests succeed with RapidFuzz >= 3 in a fresh venv)

2.7.0 (the minimum in this PR) runs fine, and 3.4.0 (current version) runs fine, too.

mikegerber commented 9 months ago

drum rolls now really ready to merge, everything looks fine. I hope I can do it tomorro, and dive into "This branch has conflicts that must be resolved".

maxbachmann commented 9 months ago

2.7.0 (the minimum in this PR) runs fine, and 3.4.0 (current version) runs fine, too.

Yes there shouldn't be any breaking changes that would affect the usage in dinglehopper

mikegerber commented 9 months ago

2.7.0 (the minimum in this PR) runs fine, and 3.4.0 (current version) runs fine, too. Yes there shouldn't be any breaking changes that would affect the usage in dinglehopper

@maxbachmann I'm being extra pedantic and check it because I've seen other software break because of a missing update of the requirements :)

mikegerber commented 9 months ago

Ah the uniseg dependendy needs a minimum version now, I'll add it.

mikegerber commented 9 months ago

Ah the uniseg dependendy needs a minimum version now, I'll add it.

We don't need it but @maxbachmann improved the performance there too and we want the shiny speed.

(I hope the tests run after I manage to rebase this, apparently they don't here because relevant Actions config is not in this branch yet... Relevant because I only test manually on the latest Py 3.11.)

mikegerber commented 7 months ago

FINALLY! It is merged! Something related to PR #83 broke, but I'll fix that soon, wanted to get this in after all this time.

🍾

mikegerber commented 7 months ago

Something related to PR #83 broke, but I'll fix that soon, wanted to get this in after all this time.

That was just a drained generator, I fixed it in c168155.

maxbachmann commented 7 months ago

Wow cool to finally see this in :partying_face: