google / re2j

linear time regular expression matching in Java
Other
1.18k stars 160 forks source link

Future work #17

Open sopel39 opened 8 years ago

sopel39 commented 8 years ago

Dear Community,

As some of you know we at Teradata are planning to incorporate RE2J into Presto database (https://github.com/facebook/presto) since it is based on lighting fast RE2 and has potential for huge performance improvements. As part of our research on RE2J we found that it currently lacks many algorithms and optimizations that make RE2j so fast. Those optimizations require byte matching and not rune matching like in RE2J. Therefore as part of our work we would like to make RE2J internally work on UTF8 bytes. Additionally, we would like an input data to be represented by Slice (https://github.com/airlift/slice). Slice is basically a wrapper around byte[] array that can be sliced. It would allow us to directly match raw bytes. We believe that our changes will make RE2J one of the fastest matching solutions for Java in big data high performance scenarios. However, our approach might break current functionality of matching Java UTF16 Strings. We might add support for it later, but it will be less efficient than raw bytes matching because conversion from Strings to Slices is required. Our primary focus would be on matching UTF8 sequences. Our POC branch with RE2J on Slices is here: https://github.com/Teradata/re2j/tree/re2j_on_bytes. What do you think about our plan?

Here are some useful FAQs that additionally explain our motivations:

Slices are effectively a wrapper around a memory fragment. Since we would like RE2J to match UTF8 bytes instead of runes it is logical to use a structure that provides fast access to bytes and allows slicing. It happens that Slice is such a structure and is also used in Presto.

Because raw byte[] array cannot be sliced without memory copying. We aim for maximum performance.

Because ByteBuffer is an interface and will introduce virtual method calls when RE2J reads input bytes. This will hurt performance.

Having MachineInput abstracted with multiple implementations will introduce virtual method calls and will hurt RE2J performance.

We could compile matching program specifically for UTF16 byte sequences and convert Strings to Slices. Matching will be efficient but would require one memory copying (String to Slice).

Presto uses UTF8. Additionally I would risk saying that most of the text in big data is either stored as ANSI or UTF8. That makes UTF8 our primary target. See http://utf8everywhere.org/.

Original RE2 matches bytes. Byte matching allows us to port algorithms and optimizations from RE2. Without those algorithms we won’t be able to match RE2 performance.

If you don’t care about performance then Java Regular Expressions is probably your number one choice. If you want high performance matching then you won't choose Java Regular Expressions. RE2J is not a drop in replacement anyway since it doesn't support backreferences. Therefore I don’t think we should target Java Regular Expressions users.

Absolutely. Slices are small airlift subproject so there is not a lot of dependency. Additionally, I think big data and high performance community will care more about fast UTF8 bytes matching solution.

sjamesr commented 8 years ago

Can you rebase your changes on top of the current master branch? It would be nice to see the benchmarks running in caliper.

sopel39 commented 8 years ago

I will rebase that on Monday. Currently there is only NFA, so the performance of matching UTF8 bytes is worse then matching runes because algorithm fans out on Threads for UTF8 byte ranges (just like RE2 does). Our next step is to add DFA. We expect to see significant performance improvements then.

sjamesr commented 8 years ago
sopel39 commented 8 years ago

Caliper benchmarks currently convert String inputs to UTF8 Slices, which causes overhead. We have internally run JMH benchmarks and worst case (wildcard .) cases took ~2-2.5 times compared to rune matching. Simple patterns (no fanout) were slightly faster.

We also have POC Presto prototype with RE2J on Slices but with rune matching and our test queries had noticeable improvement compared to String RE2J.

I will port benchmarks such that inputs are static UTF8 Slices. This will give more comparable results.

sopel39 commented 8 years ago

Here are initial caliper results with byte matching: https://microbenchmarks.appspot.com/runs/52666b39-acd4-458e-9b99-49b2304ba69a#r:scenario.benchmarkSpec.methodName,scenario.benchmarkSpec.parameters.implementation

alandonovan commented 8 years ago

On 27 November 2015 at 11:04, Karol Sobczak notifications@github.com wrote:

Dear Community,

As some of you know we at Teradata are planning to incorporate RE2J into Presto database (https://github.com/facebook/presto) since it is based on lighting fast RE2 and has potential for huge performance improvements. As part of our research on RE2J we found that it currently lacks many algorithms and optimizations that make RE2j so fast. Those optimizations require byte matching and not rune matching like in RE2J. Therefore as part of our work we would like to make RE2J internally work on UTF8 bytes. Additionally, we would like an input data to be represented by Slice ( https://github.com/airlift/slice). Slice is basically a wrapper around byte[] array that can be sliced. It would allow us to directly match raw bytes. We believe that our changes would make RE2J one of the fastest matching solutions for Java in big data high performance scenarios. However, our approach might break current functionality of matching Java UTF16 Strings. We might add support for it later, but it will be less efficient than raw bytes matching because conversion from Strings to Slices is required. Our primary focus would be on matching UTF8 sequences. Our POC branch with RE2J on Slices is here: https://github.com/Teradata/re2j/tree/re2j_on_bytes. What do you think about our plan?

RE2J has users. How do you propose to change both the API and the regular expression language without breaking them?

It seems to me you are describing a fork of RE2J, not a modification of it.

sopel39 commented 8 years ago

Yes, currently it is a fork, but it is possible to support String API again. In order to do so a program generation for UTF16 bytes is required. Additionally, String data will be copied to Slice and then matched. In current master UTF8 byte data needs to be converted to String, which is rather slow.

sopel39 commented 8 years ago

Here are the newest Caliper results from using DFA from https://github.com/Teradata/re2j/tree/re2j_on_bytes branch:

https://microbenchmarks.appspot.com/runs/b7e3d122-784d-4279-bfc7-ddd8a81dd94e#r:scenario.benchmarkSpec.methodName,scenario.benchmarkSpec.parameters.implementation

alandonovan commented 8 years ago

Karol, let me be clear: first, you must not make incompatible changes to this package. It has many existing clients, some of which you cannot see, and if you make the changes describe above, those programs will break.

Second, I am very uneasy about adding a dependency on the almost totally undocumented and terrifyingly unsafe-looking airlift/slice. What is the slice abstraction? Why not have RE2J define an interface that abstracts and input stream and make airlift/slice implement that interface?

sopel39 commented 8 years ago

@alandonovan that pull request was created by mistake (it goes to TD fork), I have closed it already.

I understand that you are reluctant to use Slice, but interface is also not that great idea. This is because reading input takes considerable amount of CPU %. When you have an interface with multiple implementations, then JIT won't be able to inline method calls and will have to use virtual calls. This will cause performance degradation, which can be avoided by having just one concrete class.

sopel39 commented 8 years ago

Slice is convenient for us since it is used in Presto, therefore data doesn't have to be converted. Of course it is possible to replace Slice with other structure, but Slice gives some nice functionalities like String<->Slice conversion, builders, can work with off-heap memory, etc. It aims for efficiency.

alandonovan commented 8 years ago

that pull request was created by mistake (it goes to TD fork), I have closed it already.

OK, never mind.

When you have an interface with multiple implementations, then JIT won't be able to inline method calls and will have to use virtual calls. This will cause performance degradation, which can be avoided by having just one concrete class.

I understand that dynamic calls come at a cost relative to static calls to a single concrete type, but if you intend this package for use outside Teradata, or perhaps for other purposes even within your company, you could make the API more general and useful by expressing it in terms of (byte[], start, length) triples rather than a complex (and unsafe) third-party package. I have my doubts that the cost of making a single bulk copy of the input is even measurable relative to the cost of running the matcher.

BTW, my concerns about airlift/slice's safety are not academic; see Paul Sandoz of Oracle's "Safety Not Guaranteed" slide deck (http://cr.openjdk.java.net/~psandoz/dv14-uk-paul-sandoz-unsafe-the-situation.pdf).

tcha8595 commented 8 years ago

Unsafe is still up in the air, especially with Java 9 on the way (despite delays). It's been a hot topic due to Jigsaw. Lots of projects relying on Unsafe has to make changes. There are talks of a public API, but that again is a change, uncertainty and not the way forward.

sopel39 commented 8 years ago

you could make the API more general and useful by expressing it in terms of (byte[], start, length) triples

I though about using triples, however you would have to use them in multiple places. This would make the code more complex and less readable. Additionally, we need to have support for slice/substring type of operation. I agree that Slice is probably too much for RE2J. When we have free cycles, we might introduce some safer structure based on bytes array instead of Slice.

BTW, my concerns about airlift/slice's safety are not academic; see Paul Sandoz of Oracle's "Safety Not Guaranteed" slide deck

While Slice may not be the nicest class it works very well in practice. When you think about use cases that Slice has to support while preserving performance, then Unsafe would be probably a preferred choice.

I have my doubts that the cost of making a single bulk copy of the input is even measurable relative to the cost of running the matcher.

Imagine large documents where matching pattern is at the beginning of a text. You could also have large documents for which you could quickly determine that they don't match pattern by looking at first few bytes. By my knowledge those could be the use cases at Facebook. We ran benchmarks and without copying query runs for few seconds compared to few minutes for queries that do make copy of each input row.