opawg / user-agents

An open, platform-agnostic list of user-agent and referrer regexes for use in podcast analytics services
MIT License
122 stars 71 forks source link

Bad escaping for numerics #123

Closed jdelStrother closed 1 year ago

jdelStrother commented 1 year ago

Heya - as of https://github.com/opawg/user-agents/commit/03595565ade2639ff8ba6b60c9281c2dfdffc396, it seems like we lost most/all of the use of \d+ to represent a number

As an example, before that commit, one of the Apple Podcasts' useragent matchers was

    "user_agents": [
         "^Podcasts/.*\\d$",
         "^Balados/.*\\d$",
         "^Podcasti/.*\\d$",
         "^Podcastit/.*\\d$",
         "^Podcasturi/.*\\d$",
         "^Podcasty/.*\\d$",
         "^Podcast’ler/.*\\d$",
         "^Podkaster/.*\\d$",
         "^Podcaster/.*\\d$",
         "^Podcastok/.*\\d$",
         "^Подкасти/.*\\d$",
         "^Подкасты/.*\\d$",
         "^פודקאסטים/.*\\d$",
         "^البودكاست/.*\\d$",
         "^पॉडकास्ट/.*\\d$",
         "^พ็อดคาสท์/.*\\d$",
         "^%E6%92%AD%E5%AE%A2/.*\\d$",
         "^播客/.*\\d$",
         "^팟캐스트/.*\\d$"
     ],

the same matcher is now:

    "user_agents": [
      "^Podcasts\/.*d$",
      "^Balados\/.*d$",
      "^Podcasti\/.*d$",
      "^Podcastit\/.*d$",
      "^Podcasturi\/.*d$",
      "^Podcasty\/.*d$",
      "^Podcast\u2019ler\/.*d$",
      "^Podkaster\/.*d$",
      "^Podcaster\/.*d$",
      "^Podcastok\/.*d$",
      "^\u041f\u043e\u0434\u043a\u0430\u0441\u0442\u0438\/.*d$",
      "^\u041f\u043e\u0434\u043a\u0430\u0441\u0442\u044b\/.*d$",
      "^\u05e4\u05d5\u05d3\u05e7\u05d0\u05e1\u05d8\u05d9\u05dd\/.*d$",
      "^\u0627\u0644\u0628\u0648\u062f\u0643\u0627\u0633\u062a\/.*d$",
      "^\u092a\u0949\u0921\u0915\u093e\u0938\u094d\u091f\/.*d$",
      "^\u0e1e\u0e47\u0e2d\u0e14\u0e04\u0e32\u0e2a\u0e17\u0e4c\/.*d$",
      "^%E6%92%AD%E5%AE%A2\/.*d$",
      "^\u64ad\u5ba2\/.*d$",
      "^\ud31f\uce90\uc2a4\ud2b8\/.*d$"
      ],

eg "^Podcasts\/.*d$" only matches user agents ending with a "d", not a numeric.

I'd also suggest the \/ in there is a bit weird - it's harmless, but in JSON "/" and "\/"are equivalent, AFAIK.

johnspurlock commented 1 year ago

Thanks - would you be willing to put together a PR for the \\d regressions specifically?

I'm in the middle of refactoring this user-agents list at the moment (for use in op3.dev), that changes quite a few of a quirks in this current form (including the unnecessary \/, removing lookaheads, duplicates), but it's probably a few days out from being ready to use.

johnspurlock commented 1 year ago

Thanks! You may be interested in the new version of this repo, which does not have these issues (I double checked each one in your PR), and include autotesting against found examples to prevent regressions like this.

👉 user-agents-v2