totpero / DeviceDetector.NET

The Universal Device Detection library will parse any User Agent and detect the browser, operating system, device used (desktop, tablet, mobile, tv, cars, console, etc.), brand and model.
Apache License 2.0
348 stars 73 forks source link

Replaced MS Regex with Pcre #19

Closed markatosi closed 5 years ago

markatosi commented 5 years ago

I replaced the Microsoft Regex calls with the library from https://github.com/ltrzesniewski/pcre-net

This resulted in a nearly 4x speed increase on my development machine with 1 thread and a 5.5x increase with 8 threads which seems to be on par with the php version of Device detector. This is important performance improvement if one needs to parse large numbers of user agents on a regular basis.

I'm not advocating that you do this in your project I'm just mentioning it in passing for anyone that requires faster performance.

I did not change any code other than replacing all MS regex calls with their equivalent Pcre calls. I performed this test on a 2019 iMac 3.6ghz core i9 using one thread. This particular source data file contains 3,576,720 unique agent strings.

Using standard regex

analyzing agents.... Lines Count: 3,576,720 thread: 0 1,000 of 3,576,720 done Thread: 0 in 00d:00h:00m:16s:583ms 2,000 of 3,576,720 done Thread: 0 in 00d:00h:00m:15s:772ms 3,000 of 3,576,720 done Thread: 0 in 00d:00h:00m:15s:506ms 4,000 of 3,576,720 done Thread: 0 in 00d:00h:00m:16s:081ms 5,000 of 3,576,720 done Thread: 0 in 00d:00h:00m:15s:897ms 6,000 of 3,576,720 done Thread: 0 in 00d:00h:00m:15s:416ms

analyzing agents.... Lines Count: 3,576,720 thread: 0 1,000 of 3,576,720 done Thread: 0 in 00d:00h:00m:04s:199ms 2,000 of 3,576,720 done Thread: 0 in 00d:00h:00m:04s:231ms 3,000 of 3,576,720 done Thread: 0 in 00d:00h:00m:04s:242ms 4,000 of 3,576,720 done Thread: 0 in 00d:00h:00m:04s:198ms 5,000 of 3,576,720 done Thread: 0 in 00d:00h:00m:04s:280ms 6,000 of 3,576,720 done Thread: 0 in 00d:00h:00m:04s:288ms

If I use 8 threads the MS regex version will process 8000 agents in 39 seconds If I use 8 threads with the Pcre version will process 8000 agents in 7 seconds

Your mileage may vary but I'm pretty darn happy with this improvement.

totpero commented 5 years ago

Hi @markatosi thanks for your suggestion, i will investigate this 👍

ghost commented 5 years ago

@totpero , any update on this matter? Should this suggestion be considered as a viable option for faster parsing?

totpero commented 5 years ago

Hi @markatosi I just push some changes: Now you can use your own Regex implementation, I have created IRegexEngine interface and by default if is not set is used MsRegexEngine but you can replace this with your implementation; I have implemented in different project PcreRegexEngine;

You can use it like this:

var deviceDetector = new DeviceDetector(ua);
deviceDetector.SetRegexEngine(new PcreRegexEngine());

Or in every parser like this:

var botParser = new BotParser();
botParser.SetRegexEngine(new PcreRegexEngine());

With my PcreRegexEngine implementation not all tests pass; If i miss something or if you have something to add...fell free to do it.

Thanks