google / guava

Google core libraries for Java
Apache License 2.0
50.14k stars 10.89k forks source link

CharMatcher inefficiencies #1223

Closed gissuebot closed 9 years ago

gissuebot commented 9 years ago

Original issue created by Maaartinus on 2012-12-03 at 10:18 PM


There are some performance problems with the CharMatcher.

  1. Missing setBits.

The precomputation may take up to 1/3 millisecond due to missing method anyOf.setBits. Adding the one-liner makes it 15+ times faster. The classes RangesMatcher, WHITESPACE, BREAKING_WHITESPACE, ANY, and NONE could all easily implement setBits too, but this is less important.

  1. Problems of SmallCharMatcher.

2.1. Slow precomputation.

After fixing the subissue 1, the precomputation of SmallCharMatcher is still rather slow. For example for GERMAN_59 (the set of all German letters; see the attached benchmark) it tries all sizes from 59 to 128. This problem is rather inherent, as the search for a collision-less table has rather negligible success chances due to the "birthday paradox".

2.2. Low success chances.

The attached SmallMatcherDemo shows that for more than 30 randomly chosen chars, using SmallCharMatcher makes hardly any sense: You spend quite some time in the precomputation just to most probably end up with table.length=128 and reprobe=true. It's even easy to create a contrived example where this happens for 13 chars only (reprobe=true despite the resulting load factor of 0.1).

  1. No need for SmallCharMatcher.reprobe.

In case the SmallCharMatcher fails to fit the chars without collisions, it falls back to reprobing. But SmallCharMatcher.from should better give up, as the MediumCharMatcher can do the job faster because of using bitwise & instead of % (with reprobe=true the modulus always equals to 128).

  1. No savings from avoiding the BitSetMatcher.

Since Issue 1049, the old criterion for using the hashing matchers doesn't make much sense anymore. By using the hashing matchers you hope to trade speed for substantial memory savings, but the BitSetMatcher trims its table and thus often needs less memory than the others.

As an example, consider GERMAN_69 (the set of all German letters and digits). Here, totalChars=69 gets doubled and rounded up to the next power of two, thus the resulting MediumCharMatcher takes 256 chars, i.e., 512 bytes (plus some fixed overhead). As all the included characters lie below 256, the BitSetMatcher would trim its table to 256/64=4 longs, i.e., 32 bytes.

So the created matcher is both over 20 times slower and (ignoring the fixed overhead) 8 times bigger than necessary. This is a special case caused by using Latin-1 only, but similar cases might be quite common, as Unicode was designed so that the most common chars are small (unless dealing with some for us exotic language).

  1. Bad performance of both hashing matchers.

Both hash-based matchers get slow due to a cause similar to Issue 1155:

The problem is caused by 1. big continuous filled area of the table, 2. using linear reprobing. Any lookup starting near the beginning of such a continuous area takes a very long time as it has to traverse to its end.

While such areas might be rare in case of a general ImmutableSet, with character ranges they might be quite common.

The solution used for the ImmutableSet would work here too, but it'd be relatively more costly (The fix for ImmutableSet was simply replacing Hashing.smear by a comparably expensive function, but here it'd mean adding such a function). I hope to be able to offer a good solution soon.

NOTES

The patch is meant to demonstrate the issue, which makes it much larger than the proposed changes themselves need.

The benchmark has two parts: measuring the precomputation time and measuring the performance of countIn(s) where s a permutation of all 2**16 chars. The latter actually measures the speed of matches.

gissuebot commented 9 years ago

Original comment posted by lowasser@google.com on 2012-12-03 at 10:21 PM


I'm not convinced all of these are worth pursuing, largely because I'd tend to expect CharMatcher.precomputed() to be called at most once per CharMatcher.

I'll look into these, though.


Status: Accepted Owner: lowasser@google.com Labels: Type-Performance, Package-Base

gissuebot commented 9 years ago

Original comment posted by Maaartinus on 2012-12-12 at 08:01 AM


Only the first two subissues deal with the timing of the precomputed() method. I'm not claiming it's important, but the fix is shorter than this paragraph.

The other thing is that sometimes you lose performance after precomputation (e.g., for the set of all German letters you get a 5x slowdown).

gissuebot commented 9 years ago

Original comment posted by lowasser@google.com on 2012-12-14 at 06:19 PM


https://github.com/google/guava/commit/7218498c1e19a789a5064ae750db7e929e2e8287


Status: Fixed

gissuebot commented 9 years ago

Original comment posted by lowasser@google.com on 2012-12-26 at 05:02 PM


(No comment entered for this change.)


Labels: Milestone-Release14