Producing the AVRO schema with AWS Athena

jamescridland commented 5 years ago

This looks like a great project - thanks so much for it.

If you host on Cloudfront + Amazon S3, the simplest way to get logs out is to use AWS Athena to query your log files. Here's where I've got to so far with producing AVRO schema output using an AWS Athena SQL query...

SELECT lower(to_hex(md5(to_utf8(requestip)))) AS encoded_ip, useragent AS user_agent, method AS http_method, 'timestamp-todo' AS timestamp, uri AS episode_id, * FROM cloudfront_logs WHERE uri LIKE '/audio/%' AND method='GET' LIMIT 10

It would be helpful in your spec to identify the format of your timestamp, please (AWS Athena gives you a date and a time, which will need some work).
byte_start and byte_end are going to be interesting to extract, since these aren't in the logfile data. I can see whether it's a 200 or 206 response, and the total amount of bytes transferred.

Would you be able to clarify what this means:

It can't be a streaming request for the first 2 bytes (i.e. Range: bytes=0-1)

...it would be good to see if I can do that filtering this end.

anrope commented 5 years ago

Is this the log format you're working with?

https://docs.aws.amazon.com/athena/latest/ug/cloudfront-logs.html

Most HTTP servers allow you to log the HTTP Range header with each request, which identifies streaming downloads.

Some podcast clients, most notably Apple Podcasts, send a request with the HTTP header "Range: bytes=0-1" to check if the hosting server supports streaming downloads. This "stream check" request is sent before any actual download happens, so we need to throw it away.

Could you share a few lines of your logs? May be helpful to look at a series of requests from the same IP (hashed) with an AppleCoreMedia user agent for one file.

Re: timestamp format, we want RFC 3339, will update.

jamescridland commented 5 years ago

Here's a short sliver of the output of the above. All columns after episode_id are pulled from the logfile in their entirety (excepting IP address, which I've blanked).

Typically, you see groups of two requests:

a 324 byte file being sent with a 206 header.
a 1204184 byte file being sent with a 206 header to the same device.

However, this is also showing four requests - 322 bytes, 329 bytes, 65935 bytes and 1077231 bytes - from the same device within the same five seconds.

The easiest thing to do would be to simply add WHERE bytes>512 which would remove any of these small range requests. I suspect that the "bytes" include the HTTP overhead here. Does all this look sensible?

As an aside - with access to the two external tables - OPAWG and the server list - I think it perfectly possible to do this entire analysis to the oDL spec with an AWS Athena query, and that's what I'd like to work towards. This could mean oDL-compliant numbers with nothing more than AWS Athena, which would be an awesome thing.

04cc2eba-6dff-4193-ad98-e24718e1634b.csv.zip

tomrossi7 commented 5 years ago

This is exactly what we have been digging into! I love the idea behind oDL, but am struggling with the byte start and stop. Our CDN logs the requests using the W3C Extended Log File Format. I like what @jamescridland is proposing, but I'm not sure if we would miss downloads that were a collection of very small byte range requests.

jamescridland commented 5 years ago

Hey, @tomrossi7 !

I'm not sure if we would miss downloads that were a collection of very small byte range requests

It would be helpful, I think, if we were to understand whether there are any podcast apps out there that make small byte range requests. A 128kbps file means 16000 bytes per second, plus HTTP overhead. Might be worthwhile asking in the OPAWG Slack group ( james@crid.land if you want in).

alexis-focheux commented 5 years ago

(Thank you for this project, it brings the transparency and clarity that the podcast ecosystem needs).

Most HTTP servers allow you to log the HTTP Range header with each request

AFAIK it's not possible to get this information on Google Cloud CDN either.

It would be helpful, I think, if we were to understand whether there are any podcast apps out there that make small byte range requests.

I've studied the distribution of the range length requested by podcast app and so far, i can see a big gap from 1 or 2 bytes to something high. This means we could potentially find a relevant threshold. For example, here are my data for Apple Podcast and CastBox. The quantiles are the deciles of the distribution of the range length. E.g. for ApplePodcast bottom 0% request is 1 byte long, bottom 10% -> 2, bottom 40% -> 49768 ...

{ "user_agent": "AppleCoreMedia", "quantiles": [ "1", "2", "2", "2", "49768", "65536", "6667915", "16094692", "29109230", "45638334", "391259959" ] }, { "user_agent": "CastBox", "quantiles": [ "1", "1", "1", "2026883", "8393837", "15321773", "23429027", "31025459", "41280285", "65135484", "385443250" ] }

This, of course, needs more detailed analytics but this look promising.

open-downloads / odl

Producing the AVRO schema with AWS Athena #1