umputun / rlb-stats

Stats collector and reporter for RLB
MIT License
7 stars 2 forks source link

Podtrac Measurement Methodology White paper #9

Open paskal opened 4 years ago

paskal commented 4 years ago

Podtrac Podcast Measurement Service Methodology White paper is a very interesting document which could be a very good source for inspiration on where to point the development of this tool.

This issue is a reminder for myself to go through the document and create more issues for this repo.

paskal commented 3 years ago

User agent

Podtrac gathers the following information about each hit to a Podtrac-prefixed episode:

  • the date and time of the request
  • the IP address of the client making the request
  • the URL of the target media file
  • the source of the request (software and device)
  • various other parameters in the HTTP request headers

We don't gather the source of the request (software and device) and headers (which is impossible with our measurement method, log parsing).

Once Podtrac collects this raw data at the media file level, it analyzes the data using proprietary algorithms to eliminate redundant requests, bots, and fraudulent traffic to arrive at a consistent measure of actual listener / viewer activity.

Podtrac uses a proprietary algorithm involving IP address, user agent, and other factors to aggregate multiple requests into a single Unique Download. This algorithm is constantly evolving with the industry evolves. While this algorithm provides an accurate portrayal of user behavior on a large scale, it can provide odd results in small-scale, contrived tests.

We have some notion of that, we don't track the same IP traffic within a minute multiple times. However, we don't filter out bots: doing so would require us to keep a list of bots' User Agents and checking new entries against that list, and preferably keeping this data alongside the entries to be able to filter it out later on. I think it's a high effort but low outcome activity which we should consider if we'll have enough reasons to do that instead of (extremely cheap storage-wise) aggregation we do currently.

A cheap solution for both points above could be tracking UserAgent and aggregating it, support filtering out particular UserAgents in the API and providing such ability to the user in the UI. Should we go that way?

Third-party services

In addition, Podtrac pulls in extra counts from certain third-party hosting services that cache podcast content for delivery directly to captive audiences, and that adhere to best practices in the tallying of their delivery counts.

We can try to figure this out in one-by-one fashion as soon as we'll stumble upon some such service in case it's API will allow us to do that. However, I see great use in rlb-stats as it is now, as it reliably shows most of the downloads and its change dynamic. Covering a few from numerous places file can be cached in wouldn't change the big picture.

Show-Level Statistics

Downloads by Country - This is a count of Unique Downloads by country of origin, i.e. - the number of downloads initiated by listeners / viewers with IP addresses assigned to each country. Podtrac utilizes best of breed databases to identify country of origin from client IP addresses.

Can be done by checking and writing down download country before the aggregation, should we do it?

Downloads by Source - This is a count of the Unique Downloads by source, where “source” represents prominent hardware and software platforms. As of 2016, Podtrac reports delivery separately for over 100 different podcatchers, media players, and podcast aggregation websites on all desktop and mobile operating systems.

UserAgent aggregation mentioned above would provide such info.

Global Unique Audience by Show - This is a count of the unique clients that access any of the show’s media during the specified time period. It differs from the sum of Unique Downloads for the various episodes in that downloads of multiple episodes by the same client are only counted once. This gives the most accurate measure of a show’s overall reach.

Impossible to count due to the stats aggregation: in order to do that we would need to store raw IP + UserAgent combinations indefinitely.


That's everything I could get from the linked document. Eugene, I would be waiting for your comments on that: we can start collecting country and UserAgent of the download, let me know if we should. It seems to me that it will be useful.