Large amounts of values result in addPatterns() slowdown

ehiggins0 commented 5 days ago

Hello,

I have a use case that relies on a large amount of values in the patterns - up to around 1-2k values per pattern. With the current method of building the valueMatcher (creating a new DFA and merging it with the existing DFA), this can result in multiple patterns taking minutes to complete the addPattern() calls. This is further exacerbated when using thousands of patterns, for a total of millions of possible values / states.

I took a couple hours today to work out a PoC for first constructing a trie for all patterns/values, then converting that trie to the DFA with the same structure as is currently being used. In benchmark results, it's roughly a 60% reduction in time (10.6s --> 4.0s on my laptop) with a ~20% peak memory penalty (1.2gb --> 1.47gb) for 1 million total values, and is faster for all scenarios with more than 250 values. I also see some potential to multithread this versus the current method is single threaded.

I'm thinking about a couple options for implementation:

On init have an option for all the known patterns up front
An addPatterns() method that allows the same functionality as addPattern() currently does, but it will use this new methodology (and merge the existing + new DFAs at the end as addPattern() currently does)

Is there any interest in seeing this merged in or is there other work currently being done on this issue?

timbray commented 5 days ago

Is there any interest in seeing this merged in or is there other work currently being done on this issue?

No other work, definite interest. To be honest, I assumed the current scheme would run out of gas at some point, but the performance was surprisingly good on reasonable numbers of patterns/values, so I never got around to looking at it. A couple of points:

I don't quite understand your "on init have an option…" idea
My initial reaction is positive to the addPatterns() idea.

Do you see addPatterns() working for all the different kinds of patterns or just string literals?

The multithreading potential is clearly interesting for this sort of large-scale operation.

Anyhow, I encourage you to have a shot at this. Sounds like it will be reasonably easy for you to create unit-test cases One very minor worry; the Quamina unit tests are up to ~14 seconds on my M2 Mac, I hate having long-running unit tests because people should run them every time they lean back in their chair to think, etc. And it sounds like this PR could have a couple superheavyweight unit tests. Which would force me to clean up some of the others that are unnecessarily heavy, should probably do that anyhow tbh.

Thanks for the interest! If it's not a secret, what's the application?

ehiggins0 commented 4 days ago

My concern regarding the addPatterns() implementation would be the case where that is called multiple times (say, batched 1M value sets of patterns), which would result in a merge between multiple massive FAs. If this was implemented as an option to quamina.New(), it would force only a single call to addPatterns(), then addPattern() could be used for the smaller changes made during runtime.

I will do my best to implement the other patterns as well, but I am not sure how much of what I've written thus far will translate over. I think this will require a look at the structures currently being created by the other patterns before I know for sure. Right now my focus is on strings, though.

Regarding the tests, I've currently got them implemented as a comparison between the output of my new functions vs the existing functions. Before I do a PR I'll move all the intensive tests to benchmarks, so they shouldn't impact the total time it takes to run standard tests.

This is for matching and filtering events to geographical areas, where the event is a single point (lat/lon) and the patterns are geohash regions that need to be matched against. There's probably an opportunity for an implementation that relies strictly on the coordinates (ie, PiP), but the issue is that the polygons are of variable complexity (could be MBs in size). Maybe if the need arises we'll contribute some geo patterns to handle geographic events...

timbray commented 4 days ago

Interesting. The most straightforward approach for addPatterns() is you do your trie-magic on all the pure string patterns and then merge the result with any that are more complicated. There are lots of functions like makeWildcardFA() and makePrefixFA() and so on. We'll document that this is optimized for pure strings and maybe we'll figure out a way to apply your stuff for other kinds of patterns later.

As for restricting this to quamina.New(), I'm a bit dubious… If you have 250K values now and and another 250K in an hour, is it more efficient to build from scratch with all 500k or merge the two 250k automata?

Maybe related: Quamina has an API to delete patterns, which is done in a brute-force way by remembering all the patterns, simply suppressing the matches to deleted patterns, and after a while rebuilding. That's the code in pruner.go. I wonder if it could be used. Obviously the memory cost of remembering the patterns would be significant. Hmm…

Another issue is, I did some profiling of automaton merging, found some wins, but eventually lost interest because it was efficient enough to not be a bottleneck. Would love to do some profiling of your million-value "takes minutes" scenario, or at least look at the .prof output. There may be low-hanging fruit in mergeFAs().

timbray / quamina

Large amounts of values result in addPatterns() slowdown #363