timhutton / twitter-archive-parser

Python code to parse a Twitter archive and output in various ways
GNU General Public License v3.0
2.4k stars 111 forks source link

[FAIL. Media couldn't be retrieved] with some mp4 files #92

Closed tobozo closed 1 year ago

tobozo commented 1 year ago

hey thanks for this great script! :+1:

could be a false negative but I'm getting FAIL error messages on some mp4 files:

 47/2991 media/1336316106835779593-3-ICmLbbI3-lB9nw.mp4: FAIL. Media couldn't be retrieved from 
https://video.twimg.com/ext_tw_video/1336303046565892096/pu/vid/896x720/3-ICmLbbI3-lB9nw.mp4?tag=10 
because of exception: 'content-length'

exception thrown at this line:

byte_size_after = int(res.headers['content-length'])

the content-length header appears to have a valid value though (screenshot from Firefox):

image

be well and thanks for the awesomeness !

fl0werpowers commented 1 year ago

Did you run the script completely? On the first run of the downloading part it queues up failed downloads to retry them with longer delay, and in my experience all the "content-length" errored videos do get downloaded on the second run.

tobozo commented 1 year ago

Script ran completely, the fails were evenly spread across the logs, and all initially failed mp4 ended up with the SKIPPED status on the second pass. I'm not sure what caused this though, maybe twitter doing some throttling or a glitchy load balancing unit?

Retrying the ones that failed, with a longer sleep. 4 tries remaining.

(...)

103 of 103 tested media files are known to be the best-quality available.

Total downloaded: 206.7MB = 0.20GB
Time taken: 3182s

Closing this as it's more feedback than an issue.

timhutton commented 1 year ago

It's certainly odd that we get an exception on that line. Worth trying to understand. @press-rouch any ideas?

tobozo commented 1 year ago

twitter cache servers appear to send SPDY header to the browser, could that explain the behavioral difference with the python script?

press-rouch commented 1 year ago

Huh, weird. I replicated the bug locally, but only the first time I ran it (so it did the retry, matched the content size, and didn't do the download). On subsequent runs it successfully got content-length in the first pass.

I guess we could change that line to a try and then print out the whole header on failure. I hadn't looked at this chunk of code before, I think we could make some improvements:

press-rouch commented 1 year ago

D'oh, scratch the head bit - didn't notice it's using stream=True. I've found what happens for an MP4 - it does identify that it hasn't parsed it, but it'll download it regardless of whether the local version is bigger, which seems a bit odd.

press-rouch commented 1 year ago

It seems that content-length could be missing if it's using Transfer-Encoding:chunked (see this answer). That's deprecated in HTTP/2, but it looks like Python requests still uses HTTP/1.1. My completely unsubstantiated theory is that if a video hasn't been served by Twitter in a while, then it might serve it in chunks, but once it has a warm cache then it can serve the whole thing.

real-or-random commented 1 year ago

Interestingly, all my 4 mp4 failed with a different error message:

401/406 media/[...].mp4: FAIL. Media couldn't be retrieved from https://video.twimg.com/ext_tw_video/[...].mp4?tag=10 because of exception: HTTPSConnectionPool(host='video.twimg.com', port=443): Read timed out. (read timeout=2)

but then were successfully SKIPPED:

4/  4 media/[...].mp4: SKIPPED. Online version is same byte size, assuming same content. Not downloaded.u/vid/720x1280/[...].mp4?tag=10...