pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.52k stars 1.98k forks source link

Improve regex filtering performance #5613

Closed mjkanji closed 1 year ago

mjkanji commented 2 years ago

Problem description

While benchmarking Polars vs. DuckDB to find any rows that contain any of a set of keywords/phrases, I noticed that Polars' str.contains(some_regex) implementation becomes progressively worse for more complex patterns. By comparison, I also noticed DuckDB takes the same time to filter on a pattern regardless of how complex it is. I'm wondering if regex filters in Polars could be improved somehow (maybe by looking at how DuckDB handles it)?

I also tried parallelizing the search so that each keyword in the complex pattern was handled separately (to leverage Polars' excellent multi-threading performance), but this was also significantly worse than DuckDB.

As an example,

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Tristique risus nec feugiat in fermentum posuere urna nec. Nunc id cursus metus aliquam eleifend mi in nulla. Etiam sit amet nisl purus in mollis nunc sed id. Non arcu risus quis varius quam quisque. At in tellus integer feugiat. Magna ac placerat vestibulum lectus mauris. Convallis posuere morbi leo urna molestie at elementum eu facilisis. Libero enim sed faucibus turpis in eu mi. Est ultricies integer quis auctor elit. Gravida cum sociis natoque penatibus et. Dictumst vestibulum rhoncus est pellentesque elit. Pretium aenean pharetra magna ac placerat vestibulum lectus."
 
# Create 50+ patterns to match against
regex_list = list(zip(*[iter(text.split(" "))] * 2))
regex_list = [r"\b" + r"\s".join(x) + r"\b" for x in regex]

# Simple case: just one keyword
# Polars is really fast: 2.4s
df.filter(pl.col("some_col").str.contains(regex_list[0]))

# But if I include all 50+ keywords, it becomes significantly worse

# Option 1: combine them all into one pattern by joining on "|"
# Only uses a single core
# 2min 35s on my 12M rows dataset
regex = "|".join(regex_list)
df.filter(pl.col("some_col").str.contains(regex))

# Option 2: create multiple conditions and chain OR them
# Now, all cores are used
# 2min 5s on my 12M rows dataset
from functools import reduce
 
def chain_or(a, b):
    return a | b
 
def polarify_regex(reg):
    return pl.col("body").str.contains(reg)
 
# Multiple pl.col().str.contains() filters chained together with the | operator
expr_list = list(map(polarify_regex, regex_list))
combined_filter = reduce(chain_or, expr_list)
df.filter(combined_filter)

DuckDB was able to handle even the more complex regex in ~2 seconds. And it's time to finish the filter actually didn't change whether it was a single keyword/phrase or all 50 merged into one.

Note that I also tried using collect(allow_streaming=True) with the lazy API but it made no difference in Option 1 and the workload was still single-threaded. Option 2 was parallelized, but I believe that's because of how the filter is set up and whether allow_streaming is used or not seems to have no impact on the performance.

~Also, while DuckDB is faster, it also consumed a significantly larger amount of memory (3-4GB more, in my case) than Polars on this task so maybe there's a tradeoff here. But if there's speed to be had on the table, it would be great to explore possible optimizations (or give users the option to trade faster processing for more RAM use).~

(Upon further inspection, DuckDB's excessive RAM use was coming from its inefficient implementation of lag/lead, compared to Polars' no-copy shift. Since this post is about regex filtering, I don't think that's relevant here.)

See this SO post and specifically this comment and the ones that follow it for more details and concrete numbers.

Edit: Updated the numbers to focus solely on the regex filters (and remove the other parts of the transformation for both DuckDB and Polars). Also clarified that DuckDB's excessive memory use was resulting from those other parts of the transform, and not the filtering operation itself.

pl.show_versions() ``` ---Version info--- Polars: 0.14.31 Index type: UInt32 Platform: Linux-4.14.294-220.533.amzn2.x86_64-x86_64-with-glibc2.28 Python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0] ---Optional dependencies--- pyarrow: 9.0.0 pandas: 1.5.2 numpy: 1.23.5 fsspec: 2022.11.0 connectorx: xlsx2csv: matplotlib: 3.6.2 ```
ritchie46 commented 2 years ago

Also got a DataFrame where you do the benchmark on? I ask because I think the distribution of strings you search on is also important for performance.

mjkanji commented 2 years ago

Hi @ritchie46, unfortunately, the data is private and cannot be shared. In general, though, here are some stats about the distribution of the lengths of the column, which are all Unicode strings:

(
    df
    .select(pl.col("body").str.lengths())
    .select([
        pl.col("body").min().suffix("_min"), 
        pl.col("body").max().suffix("_max"),
        pl.col("body").median().suffix("_median"),
        pl.col("body").mean().suffix("_mean"),
        pl.col("body").var().suffix("_var"),
        pl.col("body").skew().suffix("_skewness"),

    ])
)

shape: (1, 6)
┌──────────┬──────────┬─────────────┬───────────┬──────────┬───────────────┐
│ body_min ┆ body_max ┆ body_median ┆ body_mean ┆ body_var ┆ body_skewness │
│ ---      ┆ ---      ┆ ---         ┆ ---       ┆ ---      ┆ ---           │
│ u32      ┆ u32      ┆ f64         ┆ f64       ┆ f64      ┆ f64           │
╞══════════╪══════════╪═════════════╪═══════════╪══════════╪═══════════════╡
│ 1        ┆ 821896   ┆ 16.0        ┆ 35.556744 ┆ 1.5757e6 ┆ 304.124384    │
└──────────┴──────────┴─────────────┴───────────┴──────────┴───────────────┘

As you can see, it's a relatively very skewed distribution, but most of the strings are relatively small. I could maybe come up with a synthetic dataset using lorem ipsum that tries to mimic this distribution and formalize these comparisons. However, I don't see how that would change the results significantly.

Also, I edited the OP above to clarify that after removing the logic for the lag columns to focus solely on the regex filters, DuckDB no longer uses large amounts of memory. So, focusing on just the regex filters, DuckDB is able to filter on even this very skewed dataset with exceptional efficiency, all on a single core.

ritchie46 commented 2 years ago

Could you create a dummy dataset that shows the slowdown you describe?

mjkanji commented 2 years ago

Hi @ritchie46, please see this notebook. It uses the Yelp reviews dataset. Note that you'll need a machine with at least 8GB of RAM (maybe even more because of the initial ndjson to Parquet conversion) to run it. I used an ml.m5.xlarge machine.

The results for the complex regex were:

duckdb: 8s polars with a single, combined regex: 1min 38s polars with multiple filters chained together: 4min 1s

You can see that DuckDB's compute time is the same regardless of how complex the regex is. Polars, in contrast, really starts to suffer with more complex regex.

(On the plus side, for some reason, even the polars_single_filter is now parallelized and using all cores, so it seems allow_streaming=True is maybe working more reliably now. In my previous testing, this wasn't the case. I don't really know what changed. I recently upgraded to 0.14.31, but it was still single-core in my earlier tests. Using a combined regular expression is also faster now instead of chaining multiple filters together.)

ritchie46 commented 2 years ago

I don't have a yelp account. Can you compress the dataset and send it here?

braaannigan commented 2 years ago

@ritchie46 I've attached a subset that shows the same behaviour review.csv

import polars as pl
df = pl.read_csv("review.csv")
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Tristique risus nec feugiat in fermentum posuere urna nec. Nunc id cursus metus aliquam eleifend mi in nulla. Etiam sit amet nisl purus in mollis nunc sed id. Non arcu risus quis varius quam quisque. At in tellus integer feugiat. Magna ac placerat vestibulum lectus mauris. Convallis posuere morbi leo urna molestie at elementum eu facilisis. Libero enim sed faucibus turpis in eu mi. Est ultricies integer quis auctor elit. Gravida cum sociis natoque penatibus et. Dictumst vestibulum rhoncus est pellentesque elit. Pretium aenean pharetra magna ac placerat vestibulum lectus."

# Create 50+ patterns to match against
regex_list = list(zip(*[iter(text.split(" "))] * 2))
regex_list = [r"" + r"\s".join(x) + r"\b" for x in regex_list]
len(regex_list)

# Single regex
df.filter(pl.col("text").str.contains(regex_list[0]))
# Combine regexes
regex = "|".join(regex_list)
# Slower on combination
df.filter(pl.col("text").str.contains(regex))
ghuls commented 2 years ago

@mjkanji Probably you are just paying the price for advanced unicode support in regex.

# Creating a filter without `\b`.
In [22]: text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Tristique risus nec feu
    ...: giat in fermentum posuere urna nec. Nunc id cursus metus aliquam eleifend mi in nulla. Etiam sit amet nisl purus in mollis nunc sed id. Non arcu risus quis
    ...:  varius quam quisque. At in tellus integer feugiat. Magna ac placerat vestibulum lectus mauris. Convallis posuere morbi leo urna molestie at elementum eu f
    ...: acilisis. Libero enim sed faucibus turpis in eu mi. Est ultricies integer quis auctor elit. Gravida cum sociis natoque penatibus et. Dictumst vestibulum rh
    ...: oncus est pellentesque elit. Pretium aenean pharetra magna ac placerat vestibulum lectus."
    ...: 
    ...: # Create 50+ patterns to match against
    ...: regex_list2 = list(zip(*[iter(text.split(" "))] * 2))
    ...: regex_list2 = [r"\s".join(x) for x in regex_list2]
    ...: combined_regex2 = "|".join(regex_list2)

In [20]: %%timeit -n 1 -r 5
    ...: print(polars_single_filter(df, combined_regex).height)
    ...: 
    ...: 
255
255
255
255
255
27.4 s ± 1.61 s per loop (mean ± std. dev. of 5 runs, 1 loop each)

# Filter without "\b".
In [23]: %%timeit -n 1 -r 5
    ...: print(polars_single_filter(df, combined_regex2).height)
    ...: 
    ...: 
6786
6786
6786
6786
6786
554 ms ± 3.22 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

# Filter first without "\b" and filter those results again with "\b".
In [24]: %%timeit -n 1 -r 5
    ...: print(polars_multiple_filters(polars_multiple_filters(df, regex_list2), regex_list).height)
    ...: 
    ...: 
255
255
255
255
255
7.03 s ± 61.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
ghuls commented 2 years ago

Your regex is also badly constructed as you add the \b matching for every pattern.

If you write the regex as\b(foo1\sbar1|foo2\sbar|...)\b, it is even almost twice as fast as the original duckdb filtering.

In [25]: text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Tristique risus nec feu
    ...: giat in fermentum posuere urna nec. Nunc id cursus metus aliquam eleifend mi in nulla. Etiam sit amet nisl purus in mollis nunc sed id. Non arcu risus quis
    ...:  varius quam quisque. At in tellus integer feugiat. Magna ac placerat vestibulum lectus mauris. Convallis posuere morbi leo urna molestie at elementum eu f
    ...: acilisis. Libero enim sed faucibus turpis in eu mi. Est ultricies integer quis auctor elit. Gravida cum sociis natoque penatibus et. Dictumst vestibulum rh
    ...: oncus est pellentesque elit. Pretium aenean pharetra magna ac placerat vestibulum lectus."
    ...: 
    ...: # Create 50+ patterns to match against
    ...: regex_list3 = list(zip(*[iter(text.split(" "))] * 2))
    ...: regex_list3 = [r"\s".join(x) for x in regex_list3]
    ...: combined_regex3 = r"\b(" + "|".join(regex_list3) + r")\b"

In [19]: %%timeit -n 1 -r 5
    ...: r2 = duckdb_filter(df_table, combined_regex)
    ...: print(r2.shape)
    ...: del r2
    ...: 
    ...: 
(255, 1)
(255, 1)
(255, 1)
(255, 1)
(255, 1)
7.78 s ± 233 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [27]: %%timeit -n 1 -r 5
    ...: r2 = duckdb_filter(df_table, combined_regex3)
    ...: print(r2.shape)
    ...: del r2
    ...: 
    ...: 
(255, 1)
(255, 1)
(255, 1)
(255, 1)
(255, 1)
7.44 s ± 9.38 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [28]: %%timeit -n 1 -r 5
    ...: print(polars_single_filter(df, combined_regex3).height)
    ...: 
    ...: 

255
255
255
255
255
4.67 s ± 105 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
ghuls commented 2 years ago

I know it is just an example, but the current constructed regex is probably not what you want as there are . characters in it (matching every character).

When disabling unicode support for regex (ascii only), the speedup is 9x.

In [43]: %%timeit -n 1 -r 5
    ...: print(polars_single_filter(df, combined_regex3.replace(".", "\.")).height)
    ...: 
    ...: 
255
255
255
255
255
4.66 s ± 113 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

In [42]: %%timeit -n 1 -r 5
    ...: print(polars_single_filter(df, r"(?-u)" + combined_regex3.replace(".", "\.")).height)
    ...: 
    ...: 
255
255
255
255
255
540 ms ± 11.9 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

Also "\b(foo1\sbar1|foo2\sbar|...)\b" might become even faster in the future: https://github.com/rust-lang/regex/issues/787

mjkanji commented 2 years ago

I don't have a yelp account. Can you compress the dataset and send it here?

@ritchie46 You don't really need a Yelp account. The download page does ask for your name, email, and initials, but it's not a sign-up/sign in page. It's just a legal formality to indicate you've accepted their agreement for using the dataset and they say they do not store your email or use it contact you in any way.

@braaannigan Probably you are just paying the price for advanced unicode support in regex.

Your regex is also badly constructed as you add the \b matching for every pattern.

If you write the regex as\b(foo1\sbar1|foo2\sbar|...)\b, it is even almost twice as fast as the original duckdb filtering.

@ghuls Point taken! I definitely need to get better at working with regex. Unlike your results, in my testing, though, Polars is 2x slower than DuckDB for combined_regex3:

# 8.73 s ± 13.6 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
r2 = duckdb_filter(df_table, combined_regex3)
print(r2.shape, end="\r")
del r2

# 18.9 s ± 66.3 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
print(polars_single_filter(df, combined_regex3).height, end="\r")

I tested again with an even more powerful machine (16vCPU) and Polars was able to beat DuckDB on the above now: 4s vs 7s.

Another interesting tidbit: DuckDB actually gets worse if you remove the \b:

# 8.71 s ± 17.8 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
r2 = duckdb_filter(df_table, combined_regex)
print(r2.shape, end="\r")
del r2

# 30.3 s ± 39.1 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
r2 = duckdb_filter(df_table, combined_regex2)
print(r2.shape, end="\r")
del r2

Though, all of this does beg the question: why DuckDB is faster, even with my original, sub-optimal query? Is it somehow optimizing the \b blocks (like you did) before running the query or is the regex engine it uses simply faster when dealing with \b? If so, can/should polars make similar optimizations under the hood for regex newbies like me?

And how is it able to do so with just a single core? For example, if I set streaming=False and force Polars to only use one core, it performs much worse now now (18s to 41s vs. 8.7s for DuckDB):

# DuckDB for reference. Always single core
# 8.71 s ± 7.21 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
r2 = duckdb_filter(df_table, combined_regex3)
print(r2.shape, end="\r")
del r2

# Polars with streaming=True. All cores
# 18.9 s ± 31.5 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
print(polars_single_filter(df, combined_regex3, streaming=True).height, end="\r")

# Polars with streaming=False. Single core
# 41.1 s ± 55.4 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
print(polars_single_filter(df, combined_regex3, streaming=False).height, end="\r")

When you combine this with the fact that a more powerful machine was able to beat DuckDB simply because it had more cores, it leads me to believe that the regex engine underlying DuckDB is still more efficient (or is running some 'query plan optimizations' to the regex) because in an apples-to-apples single-core comparison, it does a better job on even the least optimized regex.

mjkanji commented 2 years ago

For anyone just looking for the best solution (and not as interested in the minutia), the fastest solution was the multi-stage filter (first filter without \b and then with):

# 2.53 s ± 42.5 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
%%timeit -n 1 -r 5
print(polars_single_filter(polars_single_filter(df, combined_regex2), combined_regex).height, end="\r")
ghuls commented 2 years ago

Different regex libraries have different performance and optimization stragegies. Rust regex library is Unicode aware (so "\s" contains unicode spaces (e.g. non-breaking spaces) and "\b" contains probably a much bigger set of characters than the regex library used by duckdb.

If all your text is just ASCII, you can disable unicode support and see that Rust regex is x9 faster than with unicode support for this example.

In [42]: %%timeit -n 1 -r 5
    ...: print(polars_single_filter(df, r"(?-u)" + combined_regex3.replace(".", "\.")).height)
    ...: 
    ...: 
255
255
255
255
255
540 ms ± 11.9 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
mjkanji commented 2 years ago

Different regex libraries have different performance and optimization stragegies. Rust regex library is Unicode aware (so "\s" contains unicode spaces (e.g. non-breaking spaces) and "\b" contains probably a much bigger set of characters than the regex library used by duckdb.

If all your text is just ASCII, you can disable unicode support and see that Rust regex is x9 faster than with unicode support for this example.

Got it! Thank you for the detailed explanations.

My data is Unicode (non-latin, emojis, etc.), but the keywords/phrases I'm looking for are ASCII and I don't think there would be many edge cases where a unicode character being treated as a word boundary would be a problem, so I think I'll switch unicode off.

It is actually quite interesting to see the rather stark difference when using regex with unicode on/off even on the original, unoptimized pattern (1min 38s to 2s, a ~50x speedup!):

# original pattern with unicode
# 1min 38s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
print(polars_single_filter(df, combined_regex).height)

# original pattern without unicode
# 2.18 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
print(polars_single_filter(df, r"(?-u)" + combined_regex).height)

# optimised pattern with unicode
# 18.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
print(polars_single_filter(df, combined_regex3).height)

# optimised pattern without unicode
# 2.15 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
print(polars_single_filter(df, r"(?-u)" + combined_regex3).height)
BurntSushi commented 1 year ago

To add some context here as the regex crate author: Unicode very often leads to worse performance on at least some dimension, but \b is a particularly brutal case. Namely, because it does indeed include the full Unicode definition for what a "word" character is, the faster DFA-based engines inside the regex crate cannot handle it. But the ASCII aware \b, which just requires one byte of look-around, can be handled by the DFAs. So when you use a Unicode-aware \b and you run it on non-ASCII text, the DFA can't be used and the regex crate falls back to a slower but more powerful engine. The slower engine can sometimes be an order of magnitude slower. This likely explains the 9x-50x perf differences y'all are observing.

There are other competing factors in play here too (like literal optimizations), but the Unicode-aware \b is the biggest thorn. It is the only construct that the DFAs can't support in the regex crate.

And indeed, most regex engines make \b ASCII aware by default, or don't even support a Unicode-aware \b at all.