Horrible failure with large documents

mikegerber commented 2 years ago

@stweil reported in Gitter:

Improvements of dinglehopper are very welcome. The old version took more than 4 hours to process two text files with 1875 lines each and required about 30 GB RAM. The new version terminates after 2 minutes, but with out of memory: it was killed by the Linux kernel after using more than 60 GB RAM. :-(

@cneud also submitted a large document (a newspaper page).

[ ] Investigate why the new version uses even more memory
[ ] Consider falling back to more efficient algorithms if necessary
[ ] Consider a regression test for this

mikegerber commented 2 years ago

I've asked @stweil to submit the text, as I am curious why it's using more memory.

stweil commented 2 years ago

I'll add them here soon.

stweil commented 2 years ago

I used this command (use links to get texts and result):

dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt frak2021_0.905_1587027_9141630

stweil commented 2 years ago

Unrelated: in the result the lines from GT and OCR result are side by side at the beginning, but that synchronization gets lost later. Why?

maxbachmann commented 2 years ago

@mikegerber I am not familiar with dinglehopper, but I assume the editops calculation requires pretty much memory. For long texts I currently create a levenshtein matrix of len(s1) * len(s2) * 32 Bit. Since the text files both have around 110k characters this alone should use around 45 Gb of memory.

The previous implementation used np.zeros((m + 1, n + 1), np.int), which as far as I know should require the same amount of memory. So I think this should not change the memory usage, but it could certainly be improved. This had no big pritority so far (mostly because nobody complained), but if thats the cause I could work on this.

An improved version could make use of bitparallelismus to improve the performance and only store bitvectors of the deltas between the matrix cells based on https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.1245&rep=rep1&type=pdf. Performance wise this would allow the implementation to calculate the matrix cells for 64 characters at once. Memory wise this should only require me to store the vertical positive delta vector, the horizontal positive delta vector and the diagonal zero delta vector and therefore only require around len(s1) * len(s2) * 3 Bit to store the matrix. So in the example above only around 4.2 Gb instead of 45 Gb.

Additionally it is possible to ignore some parts of the matrix which are not relevant, since they can not be on the optimal path, which in the best case (two strings of mostly similar length, which is the case in this usage) should allow the implementation to skip 25% of the matrix. This could further reduce the memory usage to 3.15 Gb.

mikegerber commented 2 years ago

The previous implementation used np.zeros((m + 1, n + 1), np.int), which as far as I know should require the same amount of memory. So I think this should not change the memory usage, but it could certainly be improved. This had no big pritority so far (mostly because nobody complained), but if thats the cause I could work on this.

I will have a look into this in the next days, I have a hunch that it takes twice at much memory because of a memory leak. Before, I had reused the matrix for two runs (1. the distance aka CER calculation and 2. the character alignment) and now I believe I was careless and just let rapidfuzz calculate it twice because it is much faster. OTOH, rapidfuzz should free the memory when it's done with the processing.

Your other suggestions sound promising! I'm going to have to read the paper. If the improved version still returns the¹ shortest aligment/distance I don't see why not to use it.

¹ or one of the alignments with the shortest possible distance, to be more precise

mikegerber commented 2 years ago

Unrelated: in the result the lines from GT and OCR result are side by side at the beginning, but that synchronization gets lost later. Why?

I've opened #63 for this!

mikegerber commented 2 years ago

I used this command (use links to get texts and result):

dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt frak2021_0.905_1587027_9141630

Mirrored here:

maxbachmann commented 2 years ago

Before, I had reused the matrix for two runs (1. the distance aka CER calculation and 2. the character alignment) and now I believe I was careless and just let rapidfuzz calculate it twice because it is much faster. OTOH, rapidfuzz should free the memory when it's done with the processing.

Yes rapidfuzz should free the matrix (and as far as I am aware it does). Note that the CER calculation does not require much memory, since it only has to store the last matrix row. Only the character alignment requires the whole matrix

Your other suggestions sound promising! I'm going to have to read the paper. If the improved version still returns the¹ shortest aligment/distance I don't see why not to use it.

I already use this implementation to calculate the edit distance, just not when retrieving the editops (to keep the initial implementation simple).

mikegerber commented 2 years ago

@stweil Do you know why there are duplicates lines in the texts? It could be another bug in dinglehopper's text extraction. Would it be possible to get the PAGE versions of GT and OCR to check? (If necessary, DM me on Gitter and send it privately if that's a concern.)

stweil commented 2 years ago

The duplicate lines are fine. The texts were produced from single line text files (not from PAGE / ALTO / hOCR) which we use for finetuning of existing OCR models. And we create synthetic line images for the GT text lines, sometimes several images for the same text line.

mikegerber commented 2 years ago

The duplicate lines are fine. The texts were produced from single line text files (not from PAGE / ALTO / hOCR) which we use for finetuning of existing OCR models. And we create synthetic line images for the GT text lines, sometimes several images for the same text line.

While it is a problem that dinglehopper has issues with larger texts (i.e. newspaper pages), it is also best to feed it smaller texts if possible like it seems in this case. Alignment is O(length(GT) x length(OCR)).

Is this your use case?

Compute CER and give a visualization for all text lines (line GT vs line OCR)
Aggregate the CER over all text lines

If so, there are better ways to do it than concatenating the input. I can imagine implementing a special mode to read directories of line texts and summarize the result into one page and one global CER. Could you describe your input before the concatenation? (i.e. "one directory with .gt.txt line texts and one directory with .ocr.txt line texts with the same prefix")

stweil commented 2 years ago

Here I have one directory with .gt.txt lines and several directories with OCR results .txt which where produced with different software / models / process parameters. Each dinglehopper run should compare the GT directory with one of the OCR results directories. And yes, the file prefixes are the same.

But we also have the newspaper page use case.

mikegerber commented 2 years ago

But we also have the newspaper page use case.

Yes two problems with two distinct solutions :)

maxbachmann commented 2 years ago

@mikegerber I implemented the concept I described above (appears to work, but still needs more testing and some cleanup): https://github.com/maxbachmann/rapidfuzz-cpp/pull/58

It reduces the memory usage from 32 bit to 3 bit per cell (around 10x improvement)
It significantly improves the performance for all strings especially for long ones, since it reduces the time complexity from O(N*M) to O([N/64]*M). E.g. when using two strings with 20k characters I achieved around a 20x improvement in runtime.

Edit: I successfully fuzz tested the new implementation. The improved version is available in v1.9.0: https://github.com/maxbachmann/RapidFuzz/releases/tag/v1.9.0

maxbachmann commented 2 years ago

@mikegerber I tested @stweil's original files with dinglehopper on my laptop which has only 8gb ram:

(base) [max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
    0:22.97 real,   16.55 user, 7.33 sys,   5013484 mmem

So it is down to around 5gb of memory usage and less than 25 seconds of runtime with the new version of rapidfuzz.

mikegerber commented 2 years ago

@mikegerber I tested @stweil's original files with dinglehopper on my laptop which has only 8gb ram:
(base) [max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
  0:22.97 real,   16.55 user, 7.33 sys,   5013484 mmem
So it is down to around 5gb of memory usage and less than 25 seconds of runtime with the new version of rapidfuzz.

This is great news 😍 I think it was a great decision to use rapidfuzz as the backend library for dinglehopper - all the features I had wished for and with great support and improvements from you!

I'll be on vacation starting Thursday and I'll keep this issue open until I have tested this thoroughly (after the vacation). But for now I've bumped the dependency to rapidfuzz >= 1.9.1.

maxbachmann commented 2 years ago

@mikegerber I finally came around to implement editops for long sequences using a combination of Hirschbergs algorithm and the current algorithm. It splits the problems into subproblems until it is relatively small (around 2k characters) and then solves it using the existing bitparallel algorithm. This reduces memory usage to O(N) and since it jumps around less in memory it improves performance for long sequences as well.

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
    0:06.40 real,   3.65 user,  2.73 sys,   3371976 mmem

improves to

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
    0:05.78 real,   3.86 user,  1.91 sys,   92228 mmem

mikegerber commented 2 years ago

improves to

/usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
  0:05.78 real,   3.86 user,  1.91 sys,   92228 mmem

Absolutely fantastic! I'll depend on the new version when it gets a release version!

maxbachmann commented 2 years ago

I think now that the memory consumption has been reduced from around 45Gb to less than 100mb and is no longer quadratic to the text length I think this issue has been resolved :wink:

mikegerber commented 1 year ago

I think now that the memory consumption has been reduced from around 45Gb to less than 100mb and is no longer quadratic to the text length I think this issue has been resolved 😉

Yeah I think so too, just need to test it again!

mikegerber commented 1 year ago

Using @stweil's example:

% /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
        0:04.15 real,   5.00 user,      1.35 sys,       76916 mmem

I've also tested files that @cneud gave which exploded to 40 GB memory usage and the system would swap. Now they run in less than a second!! I'll add them to the test suite.

mikegerber commented 1 year ago

0fd4ea1

qurator-spk / dinglehopper

Horrible failure with large documents #62