Closed 3rd-Eden closed 6 years ago
Ideally we should base that on real data. @elsigh think you could get that from browserscope, for example?
@tobie I don't think Browserscope's data will provide an accurate representation of browser/platform marketshare of the web as a whole. Net Application's data should be more representative and is freely available.
Good point.
Net Application seems to have different data then statcounter http://gs.statcounter.com/;
The goal here is to see if there's a lightweight way to organize this without compromising the data or making further updates complicated. Micro-changes to the regexp order will not have a significant impact on perf and we should not waste time on them. Eg. moving the IE regexp from position 100 to position 3 makes a lot of difference. From 3 to 2, not so much. :)
Fundamentally, ua-parser could be written to be much faster by doing string tree branching, but that doesn't lend itself really well to a shared, readable YAML file. Part of the reason I believe this project works and is being maintained and now exists in so many languages is that its fairly easy to grok - and easy to just add new regexes and whatnot.
I'm not against re-ordering the YAML file provided that the tests don't break, but I'd like to see data that shows how much faster this is. None of the implementations (I believe) scan the YAML file on every parse. They should be able to put the YAML (or JSON) file into memory and then be iterating over an array of in-memory regexes.
AFAIK, @3rd-Eden's suggestion is to order the regexp by most common UA first. I think that could make sense if it's kept reasonable and could yield substancial speed-ups on real world data.
Indeed - though we can't just put most common UA first afaik, because there's a whole bunch of exceptions that have to come first, although - the regexes could be written to catch the most common UAs first (ala the notion of string branching) =) that's probably what you're thinking.. Sweet!!
On Fri, Jan 11, 2013 at 2:27 PM, Tobie Langel notifications@github.comwrote:
AFAIK, @3rd-Eden https://github.com/3rd-Eden's suggestion is to order the regexp by most common UA first. I think that could make sense if it's kept reasonable and could yield substancial speed-ups on real world data.
— Reply to this email directly or view it on GitHubhttps://github.com/tobie/ua-parser/issues/162#issuecomment-12166664.
Is there documentation on how to run user agent tests somewhere? I'd like to tinker with this a bit and just want to make sure that my changes don't cause any regressions.
You'll want to have a look at: https://github.com/tobie/ua-parser/blob/master/py/ua_parser/user_agent_parser_test.py
Thinking about this from the perspective of reading in thousands to millions of log files, having the UA strings ordered in the yaml file by most common would be awesome.
The ordering I am I talking about, and I think others are too, is to put the most common first in the list. No fancy tree branching, just a list that is ordered by most common UA string first.
Some possible issues and/or pain points of this update:
We're gathering number 3, and yes, the order definitely matters right - that's why we have to accept either additional complexity (which is fine. It's the sort of thing JSON and YAML are designed for) or inaccuracy.
Thanks to some work by some team members we did see massive performance gains by memoizing the parse lookup.
On an individual basis it may be faster to change the ordering of the regex statements. However, once we did the memoizing the change of order was not a measurable difference in speed. Basically we keep it all in the same process while we look at the 100 thousand lines or so in the logs.
I'm not sure how much performance can be gained if a consumer still calls the regex match for each time. Doing the memoizing lowered the time from about 10 seconds to 1 second. Changing the order of the regexes almost didn't have an effect at that point.
So, why don't we memoise and cache rather than changing the order? ;)
So, why don't we memoise and cache rather than changing the order? ;)
@Ironholds I see two use cases:
parse
to get the useragent string one time across many processes (not sure memoizing can happen here).parse
to get the useragent string many times in one processI think point 2 is a good candidate for memoizing, and it could potentially be added to the Parse
class. I think point 1 may not benefit too much with either change.
My biggest concern with adding the memoizing to the Parse
class would be thread safety. I don't think it matters though, because the definition of a useragent string won't be different between threads.
I think memoizing is a good idea, if there are any concerns it could be off or on by default and allow an override.
Yep, although if we're talking about the Python implementation specifically, implementations are meant to follow a common standard. I'm not sure if that applies to things that don't change the user outcome. @tobie ?
Implementations do whatever they want as long as they pass the test suite.
Alrighty! @mhemesath , patch away :D
I have been spending some quality time with my useragent parser module for Node.js to try and squeeze out every bit of performance possible. While it's already the fastest parser out there, the majority of the performance slowdown comes from iterating over the YAML file's regular expressions and executing it on the given user agent string. I experimented with changing the order of the and it had positive effects on the benchmarks.
So I'm wondering if there are ways to safely optimize the order of the regular expression and optimize for the general browsers first by placing them higher in the YAML file.