opawg / user-agents-v2

Comprehensive open-source collection of broadly-compatible regular expression patterns to identify and analyze podcast player user agents.
MIT License
43 stars 17 forks source link

Please review and add new Apple User Agents #14

Closed sherlockbonez closed 6 months ago

sherlockbonez commented 7 months ago

We are seeing an uptick in AppleCoreMedia user agents for iPhone, iPad, and Apple TV. These aren't included in the OPAWG2 list and therefore missing device and application categorization.

iPhone AppleCoreMedia//1.0.0.21B91 (iPhone; U; CPU OS 17_1_1 like Mac OS X; en_us) AppleCoreMedia//1.0.0.20G81 (iPhone; U; CPU OS 16_6_1 like Mac OS X; en_us) AppleCoreMedia//1.0.0.19H370 (iPhone; U; CPU OS 15_8 like Mac OS X; en_us) AppleCoreMedia//1.0.0.20H115 (iPhone; U; CPU OS 16_7_2 like Mac OS X; en_us) AppleCoreMedia//1.0.0.20B101 (iPhone; U; CPU OS 16_1_1 like Mac OS X; en_us) AppleCoreMedia//1.0.0.21B91 (iPhone; U; CPU OS 17_1_1 like Mac OS X; en_ca) AppleCoreMedia//1.0.0.20D67 (iPhone; U; CPU OS 16_3_1 like Mac OS X; en_us) AppleCoreMedia//1.0.0.20G75 (iPhone; U; CPU OS 16_6 like Mac OS X; en_us)

iPad AppleCoreMedia//1.0.0.21B91 (iPad; U; CPU OS 17_1_1 like Mac OS X; en_us)

Apple TV AppleCoreMedia//1.0.0.21K69 (Apple TV; U; CPU OS 17_1 like Mac OS X; en_us)

For the iPhone and iPad entries, are there any indication if they are access via app or through safari browser?

johnspurlock commented 7 months ago

Did you mean to include the // double slashes? I'm not sure Apple has ever done that before. Not a single download to an OP3-measured show in the last year with those.

Maybe someone trying to spoof?

Do you have all http headers from one of the requests?

Also take a look at the ip addresses and see if they are from cloud IPs

knoxmic commented 7 months ago

Where exactly does the data come from? So far I have only seen the // double slashes from the access logs from AIS (AdsWizz).

sherlockbonez commented 6 months ago

Yes, these are from AIS session and access logs. This is what we see testing out user agent strings which have a double // compared to those that only have a single /

The pattern in the devices.json should caprute the iPhone for the below user agent strings being tested and properly categorize them as "Apple iPhone" but this is only the case for the single / entry.

      "name": "Apple iPhone",
      "pattern": "iphone|iOS|iPhone|CFNetwork| ios |phone;ios",
      "category": "mobile",

Here we try and run a test against the double //

--checkUserAgent 'AppleCoreMedia//1.0.0.20G81 (iPhone; U; CPU OS 16_6_1 like Mac OS X; en_us)'
Loaded UserAgent patterns from /etc/user-agents/bots.json
Loaded UserAgent patterns from /etc/user-agents/apps.json
Loaded UserAgent patterns from /etc/user-agents/libraries.json
Loaded UserAgent patterns from /etc/user-agents/browsers.json
Loaded UserAgent patterns from /etc/user-agents/devices.json
Loaded UserAgent patterns from /etc/user-agents/referrers.json
User Agent was not found in database

No match

Here we try the test for the single /

--checkUserAgent 'AppleCoreMedia/1.0.0.20G81 (iPhone; U; CPU OS 16_6_1 like Mac OS X; en_us)'
Loaded UserAgent patterns from /etc/user-agents/bots.json
Loaded UserAgent patterns from /etc/user-agents/apps.json
Loaded UserAgent patterns from /etc/user-agents/libraries.json
Loaded UserAgent patterns from /etc/user-agents/browsers.json
Loaded UserAgent patterns from /etc/user-agents/devices.json
Loaded UserAgent patterns from /etc/user-agents/referrers.json
{
  "name" : "AppleCoreMedia",
  "type" : "library",
  "device_name" : "Apple iPhone",
  "device_category" : "mobile",
  "referrer_name" : null,
  "referrer_category" : null,
  "is_bot" : false
}

Given the above test, the user agent with the single / is matching the library record. The devices patterns are only enhancements per the directions: https://github.com/opawg/user-agents-v2/tree/3f3a7e75270c5f7807de64e80013d3e0a1cf14bc#quick-start

The file only gets used if it matches one of: bots, apps, libraries, or browsers

The pattern actually being matched is from the libraries.json here: https://github.com/opawg/user-agents-v2/blob/3f3a7e75270c5f7807de64e80013d3e0a1cf14bc/src/libraries.json#L23

"pattern": "^AppleCoreMedia/1",

So by default only matches the single slash version of the AppleCoreMedia user agent strings. Our code and parsing logic works, its just the pattern that's missing. Patterns need to account for double forward slashes. Reviewing our AdsWizz Access and Session logs would appear all user agent strings contain // where normally a single / would be found. Here are some examples:

"AppleCoreMedia//1.0.0.20H115 (iPhone; U; CPU OS 16_7_2 like Mac OS X; es_xl)"
"Mozilla//5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit//537.36 (KHTML, like Gecko) Chrome//81.0.4044.113 Safari//537.36"
"Roku//DVP-12.5 (12.5.0.4178-91)"
"Dalvik//2.1.0 (Linux; U; Android 13; SM-G770F Build//TP1A.220624.014)"
"Echo//1.0(APNG)"
"AppleCoreMedia//1.0.0.21K69 (Apple TV; U; CPU OS 17_1 like Mac OS X; en_us)"

Could we add an optional additional second slash to pattern matches?

^AppleCoreMedia//?1 as an example for AppleCoreMedia entries.

knoxmic commented 6 months ago

The actual requests, i.e. the clients, send the user agent without //. AIS is the problem here, as this user agent is stored in the logs in a modified form.

If we do not have a hit, we simply replace these duplicate // in / within the log data and check again.

johnspurlock commented 6 months ago

Yes, I think keeping this project focused on the http user-agent header value is what we want to do here. If your system is adding/escaping slashes after the fact, you can get back to the actual value using a method similar to what @knoxmic suggests