tobie / ua-parser

A multi-language port of Browserscope's user agent parser.
Other
1.97k stars 499 forks source link

Optimizing the order of the YAML file #162

Closed 3rd-Eden closed 6 years ago

3rd-Eden commented 11 years ago

I have been spending some quality time with my useragent parser module for Node.js to try and squeeze out every bit of performance possible. While it's already the fastest parser out there, the majority of the performance slowdown comes from iterating over the YAML file's regular expressions and executing it on the given user agent string. I experimented with changing the order of the and it had positive effects on the benchmarks.

So I'm wondering if there are ways to safely optimize the order of the regular expression and optimize for the general browsers first by placing them higher in the YAML file.

tobie commented 11 years ago

Ideally we should base that on real data. @elsigh think you could get that from browserscope, for example?

selwin commented 11 years ago

@tobie I don't think Browserscope's data will provide an accurate representation of browser/platform marketshare of the web as a whole. Net Application's data should be more representative and is freely available.

tobie commented 11 years ago

Good point.

3rd-Eden commented 11 years ago

Net Application seems to have different data then statcounter http://gs.statcounter.com/;

Screen Shot 2013-01-11 at 16 48 26

tobie commented 11 years ago

The goal here is to see if there's a lightweight way to organize this without compromising the data or making further updates complicated. Micro-changes to the regexp order will not have a significant impact on perf and we should not waste time on them. Eg. moving the IE regexp from position 100 to position 3 makes a lot of difference. From 3 to 2, not so much. :)

elsigh commented 11 years ago

Fundamentally, ua-parser could be written to be much faster by doing string tree branching, but that doesn't lend itself really well to a shared, readable YAML file. Part of the reason I believe this project works and is being maintained and now exists in so many languages is that its fairly easy to grok - and easy to just add new regexes and whatnot.

I'm not against re-ordering the YAML file provided that the tests don't break, but I'd like to see data that shows how much faster this is. None of the implementations (I believe) scan the YAML file on every parse. They should be able to put the YAML (or JSON) file into memory and then be iterating over an array of in-memory regexes.

tobie commented 11 years ago

AFAIK, @3rd-Eden's suggestion is to order the regexp by most common UA first. I think that could make sense if it's kept reasonable and could yield substancial speed-ups on real world data.

elsigh commented 11 years ago

Indeed - though we can't just put most common UA first afaik, because there's a whole bunch of exceptions that have to come first, although - the regexes could be written to catch the most common UAs first (ala the notion of string branching) =) that's probably what you're thinking.. Sweet!!

On Fri, Jan 11, 2013 at 2:27 PM, Tobie Langel notifications@github.comwrote:

AFAIK, @3rd-Eden https://github.com/3rd-Eden's suggestion is to order the regexp by most common UA first. I think that could make sense if it's kept reasonable and could yield substancial speed-ups on real world data.

— Reply to this email directly or view it on GitHubhttps://github.com/tobie/ua-parser/issues/162#issuecomment-12166664.

selwin commented 11 years ago

Is there documentation on how to run user agent tests somewhere? I'd like to tinker with this a bit and just want to make sure that my changes don't cause any regressions.

tobie commented 11 years ago

You'll want to have a look at: https://github.com/tobie/ua-parser/blob/master/py/ua_parser/user_agent_parser_test.py

garrypolley commented 9 years ago

Thinking about this from the perspective of reading in thousands to millions of log files, having the UA strings ordered in the yaml file by most common would be awesome.

The ordering I am I talking about, and I think others are too, is to put the most common first in the list. No fancy tree branching, just a list that is ordered by most common UA string first.

Some possible issues and/or pain points of this update:

  1. If the order currently matters right now for correct detection.
  2. If there are a lot of tests that depend on the order of the UA strings in the yaml file.
  3. Showing with empirical data that ordering does give a performance boost.
Ironholds commented 9 years ago

We're gathering number 3, and yes, the order definitely matters right - that's why we have to accept either additional complexity (which is fine. It's the sort of thing JSON and YAML are designed for) or inaccuracy.

garrypolley commented 9 years ago

Thanks to some work by some team members we did see massive performance gains by memoizing the parse lookup.

On an individual basis it may be faster to change the ordering of the regex statements. However, once we did the memoizing the change of order was not a measurable difference in speed. Basically we keep it all in the same process while we look at the 100 thousand lines or so in the logs.

I'm not sure how much performance can be gained if a consumer still calls the regex match for each time. Doing the memoizing lowered the time from about 10 seconds to 1 second. Changing the order of the regexes almost didn't have an effect at that point.

Ironholds commented 9 years ago

So, why don't we memoise and cache rather than changing the order? ;)

garrypolley commented 9 years ago

So, why don't we memoise and cache rather than changing the order? ;)

@Ironholds I see two use cases:

  1. Calling the parse to get the useragent string one time across many processes (not sure memoizing can happen here).
  2. Calling the parse to get the useragent string many times in one process

I think point 2 is a good candidate for memoizing, and it could potentially be added to the Parse class. I think point 1 may not benefit too much with either change.

My biggest concern with adding the memoizing to the Parse class would be thread safety. I don't think it matters though, because the definition of a useragent string won't be different between threads.

mhemesath commented 9 years ago

I think memoizing is a good idea, if there are any concerns it could be off or on by default and allow an override.

Ironholds commented 9 years ago

Yep, although if we're talking about the Python implementation specifically, implementations are meant to follow a common standard. I'm not sure if that applies to things that don't change the user outcome. @tobie ?

tobie commented 9 years ago

Implementations do whatever they want as long as they pass the test suite.

Ironholds commented 9 years ago

Alrighty! @mhemesath , patch away :D