Avoid working on too long strings when removing accents.

mgautierfr commented 1 year ago

My use case is a 20Mb french text string to index, on my computer, a simple "transliterate" takes around 50 minutes to finish.

If we assume a ratio of 1/10 letter with accents, a 20 million char text would need 20M/10[number of replacement] * 20M/2[number of char to move] moves. It is something like : 2*10^13 char moves.

By using batch of 4K, we have 4K/10*4K/2=0.8M moves per batch. And we have 20M/4K=5000 batches. So a total of 5000*0.8M=4G (4*10^9). The ratio is 5000. So we could expect a code running 5000 times faster.

In practice, it goes from 49 minutes to 3 or 4 seconds.

@rgaudin It would be nice to know the real improvement in speed (if any) with a more common content.

codecov[bot] commented 1 year ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (main@2828a34). Click here to learn what that means. Patch coverage: 57.14% of modified lines in pull request are covered.

:exclamation: Current head 83a4da7 differs from pull request most recent head fca574e. Consider uploading reports for the commit fca574e to get more accurate results

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #797 +/- ## ======================================= Coverage ? 57.48% ======================================= Files ? 98 Lines ? 4540 Branches ? 1909 ======================================= Hits ? 2610 Misses ? 674 Partials ? 1256 ``` | [Impacted Files](https://app.codecov.io/gh/openzim/libzim/pull/797?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openzim) | Coverage Δ | | |---|---|---| | [src/tools.cpp](https://app.codecov.io/gh/openzim/libzim/pull/797?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openzim#diff-c3JjL3Rvb2xzLmNwcA==) | `53.65% <57.14%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

rgaudin commented 1 year ago

@rgaudin It would be nice to know the real improvement in speed (if any) with a more common content.

Would be very interesting but can be done post-merge right?

mgautierfr commented 1 year ago

I believe that a better fix would be to take advantage of the overload of icu::Transliterator::transliterate() that takes an extra string parameter that can be transliterated into another string. It took me a while to understand the documentation but I guess that it will do the job.

I'm not sure it would be better (or I miss understand something)

If we use it, it means we would need:

A "final" transliterated UnicodeString. (to convert to utf8 at the end)
A temporary untransliterated UnicodeString to put our batch.
A source untransliterated UnicodeString to generate our temporary from.

The current method avoid the "final" UnicodeString as we convert to utf8 each batch directly.

veloman-yunkan commented 1 year ago

I believe that a better fix would be to take advantage of the overload of icu::Transliterator::transliterate() that takes an extra string parameter that can be transliterated into another string. It took me a while to understand the documentation but I guess that it will do the job.

I'm not sure it would be better (or I miss understand something)

If we use it, it means we would need:
* A "final" transliterated UnicodeString. (to convert to utf8 at the end)

* A temporary untransliterated UnicodeString to put our batch.

* A source untransliterated UnicodeString to generate our temporary from.
The current method avoid the "final" UnicodeString as we convert to utf8 each batch directly.

My understanding was that the said overload of transliterate() was introduced to address the performance problems with in-place transliteration. I thought that transliteration of the insertion string is performed in a transliterate-and-append fashion. However looking at the source code I see that the insertion string is first added to the main text and then transliteration is performed on thus modified input which of course leaves no possibility for a more efficient implementation.

mgautierfr commented 1 year ago

Tests have been updated.

xiaoyifang commented 1 year ago

20M/10[number of replacement] * 20M/2[number of char

Where does the second part 20M/2 came from .

IMO, the accent character replacement should has O(1) complex.

batch or no batch should has no siginificant boost .

any real data statistic support ？

mgautierfr commented 1 year ago

Where does the second part 20M/2 came from .

This is the means length of the data after the transformed char that we have to move.

IMO, the accent character replacement should has O(1) complex.

IMO, no :) Except if you speak about removing the accents of only one char. At best, the accents removable should be O(n) because we need to go through all chars in the string.

The number of accents to remove (and so the number of transformation) is N/10, so is O(N). But, for each transformation, we have to move all chars after the transformation, the means number of chars to move is N/2 (means of N, N-10, N-20, ... 10, 0) so it is O(N). So removing accents from a string is O(N²).

If we use batch, the removing of accents is still O(N²) in a batch, but we are bounded by the size of the batch (so O(1) per batch). So the "variable" part is the number of batch and is is N/(4*1024), so O(N).

So using batches transforms our initial algorithm in O(N²) into O(N)

xiaoyifang commented 1 year ago

So removing accents from a string is O(N2).

Suppose string is N length. When remove accent. . iterate each character will give O(n) and for each character ,the accent removal should use a built-in hashmap to achieve this,which is O(1). The overall complexity is O(n).

I notice that we use icu to remove accent (:M) , if the icu does not perform well in such a situtation(long string.) , maybe report this to icu community?

mgautierfr commented 1 year ago

When remove accent. . iterate each character will give O(n) and for each character ,the accent removal should use a built-in hashmap to achieve this,which is O(1). The overall complexity is O(n).

This is not how it works. We use a transliterator with rules Lower; NFD; [:M:] remove; NFC. So we have 4 transformations being applied to a string:

Lower : "À is an accented, 가 is not " -> "à is an accented letter, 가 is not"
NFD : "à is an accented, 가 is not" -> "a` is an accented letter, ᄀ ᅡ is not" (notice the "new" characters, so all text after is pushed right)
[:M:] remove : "a` is an accented letter, ᄀ ᅡ is not" -> "a is an accented letter, ᄀ ᅡ is not" (notice the remove `, so all text after is pushed left)
NFC : "a is an accented letter, ᄀ ᅡ is not" -> "a is an accented letter, 가 is not".

We don't use hashmap, (and it is impossible because "accent" can be added to any letter, even composed ſ+◌̣+◌̇ == ſ+◌̇+◌̣ == ẛ̣)

xiaoyifang commented 1 year ago

I would expect icu internal use hashmap to achieve this.

À lowered will produce à , normalized('NFD') would give two character a+accent , remove mark(the accent), would produce the a.

iterate the string and create a new string without shift the remaining characters left after each single character replacement.

mgautierfr commented 1 year ago

Old issue created on ICU about that : https://unicode-org.atlassian.net/browse/ICU-8776 Closed with the suggestion kind of doing what we are doing here.

xiaoyifang commented 1 year ago

Got it :-)

mgautierfr commented 1 year ago

The only slight concern with the new unit test is that it depends on implementation details (batching scheme). Other than that we are good to merge after the PR is rebased and fixed-up.

I agree with the concern. But something white box testing is the thing to do. This is a pretty self contained feature so I think we are ok with that.

I'm merging.

openzim / libzim

Avoid working on too long strings when removing accents. #797

Codecov Report