opawg / user-agents-v2

Comprehensive open-source collection of broadly-compatible regular expression patterns to identify and analyze podcast player user agents.
MIT License
43 stars 17 forks source link

Spotify Web comes up as the users browser instead of Spotify #9

Closed redimongo closed 11 months ago

redimongo commented 11 months ago

Note sure if this could be fixed, but I noticed that when a user accessed a podcast that is hosted on our system if they are using the website version of Spotify it is not being loged as Spotify

Here is what is being logged

{
  "_id": {
    "$oid": "64d1f78a1ac45075b24feaa8"
  },
  "podcast_id": {
    "$oid": "64a3266f886b38714bac5c0a"
  },
  "episode_id": {
    "$oid": "64a32679886b38714bac5c0b"
  },
  "type": "full_episode",
  "headers": {
    "host": "tracker.podtoo.com",
    "x-forwarded-scheme": "https",
    "x-forwarded-proto": "https",
    "x-forwarded-for": "REMOVED",
    "x-real-ip": "REMOVED",
    "connection": "close",
    "accept": "*/*",
    "sec-fetch-site": "cross-site",
    "x-playback-session-id": "64E63475-32D1-4965-A936-268D91076E41",
    "accept-language": "en-AU,en;q=0.9",
    "accept-encoding": "identity",
    "sec-fetch-mode": "no-cors",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5.2 Safari/605.1.15",
    "referer": "https://open.spotify.com/",
    "range": "bytes=0-1",
    "sec-fetch-dest": "video"
  },
  "user-agent": {
    "browsers": "Safari",
    "devices": "Apple Computer"
  },
  "timestamp": {
    "$date": "2023-08-08T08:06:34.435Z"
  },
  "user_id": "REMOVED"
}

Is there anyway that maybe we could add a check to match the referer as "referer": "https://open.spotify.com/".

I know it is meant to be able to do it

 {
      "name": "Spotify",
      "pattern": "https://(open|api-partner)\\.spotify\\.com",
      "examples": [
        "https://open.spotify.com/show/3vhBp6pPJEYgGfOXGU8ogu",
        "https://open.spotify.com/",
        "https://api-partner.spotify.com/"
      ],
      "category": "app"
    },

not sure why it is not picking up.

Here is how we use this

const matchUserAgent = (userAgent) => {
  const jsonFiles = ['bots', 'apps', 'libraries', 'browsers', 'devices', 'referrers'];
  const userAgentData = getUserAgentData();

  const matchedData = {};

  for (const jsonFile of jsonFiles) {
    const data = userAgentData[jsonFile];
    if (data && data.entries) {
      for (const entry of data.entries) {
        const pattern = new RegExp(entry.pattern);
        const match = userAgent.match(pattern);
        if (match) {
          matchedData[jsonFile] = entry.name;
          break;
        }
      }
    }
  }

  return matchedData;
};

Maybe it's to do with

if (match) {
          matchedData[jsonFile] = entry.name;
          break;
        }

As clearly it is stopping.

redimongo commented 11 months ago

I think I just noticed a rookie mistake the referrers needs to be looking at

"headers": {

    "referer": "https://open.spotify.com/",
    },

Not the user agent I'll update my code to make sure it can.

johnspurlock commented 11 months ago

Yea this library is meant to be used by everyone: folks that look only at User-Agent and want to track browsers only, and others that also look at Referer - which I agree good services should do to avoid penalizing web-based apps.

Here's where I try to spell this out in the readme:

(Optional) If type is browser and you also have the HTTP Referer header in your logs, to additionally break down by known web apps:

Remove any newlines (never occurs except from bad actors) Iterate the referrers pattern file entries array in order, returning the first entry where pattern matches the Referer This will always result in either 0 or 1 entry If found, the referrer entity may also have a category of app (for web-based apps) or host (for podcast hosting company players)