Closed ghost closed 5 years ago
Ouch! Definitely a problem, yes.
I'm looking at the code again and I see that for media, I don't log in. That also means that we can't get throttled, and it also means that we cannot "pace". I guess I could write my own pacing code, but right now the other commands can do pacing because the Mastodon library does it.
For media, we don't use the Mastodon library. We just go through the archive, get the URLs we need to fetch, and then we download the media we're missing. Thus, I get the impression that the code should handle restarts gracefully as the files from the first run are still there and don't need downloading. The question is: why doesn't this work for you?
Perhaps a simple "fix" would be to reverse the order of the URLs. Start from the end. I see that many of the media files you're trying to fetch apparently don't exist, or no longer exist (404 Not Found). Perhaps the newer media files would still exist?
That's right, this is a first run from the start of the archive. I see a media-proxy
directory:
$ ls social.wxcafe.net.user.amphetamine/media_proxy/
1182186/
This directory has no files. I also see a media_attachments
directory:
$ ls social.wxcafe.net.user.amphetamine/system/media_attachments/files/
000/ 001/
I just ran it again for myself:
[...]
Downloading |############################ | 88/100
Forbidden: https://s3.us-west-2.amazonaws.com/dice-camp-mastodon/media_attachments/files/000/196/393/original/d75241f79a954797.jpeg
Downloading |############################### | 99/100
Forbidden: https://s3.us-west-2.amazonaws.com/dice-camp-mastodon/media_attachments/files/000/360/544/small/59cb8fb9abe4617b.png
Downloading |################################| 100/100
Forbidden: https://s3.us-west-2.amazonaws.com/dice-camp-mastodon/media_attachments/files/000/360/544/original/59cb8fb9abe4617b.png
44 downloads failed
As you can see, 44 of 100 downloads failed. If I rerun it, it will attempt these 44 downloads again.
Apparently, the files are all in the correct media directory:
~/Documents/Mastodon$ ls dice.camp.user.kensanata
dice-camp-mastodon media_attachments
And I seem to have all 54 files:
~/Documents/Mastodon$ find dice.camp.user.kensanata -name '*.jpg' -o -name '*.jpeg' -o -name '*.png' | wc -l
54
So, currently all I can say it that it seems to "work for me".
Looking at what you wrote, it might very well be that it works for you as well. If you run find social.wxcafe.net.user.amphetamine -type f | wc -l
you should get a file count over all the subdirectories.
So I think what I'll do just now is I'll start downloads in reverse order. Perhaps that simply gives us the best chances of downloading files that are still available on the server.
28e4e58 downloads URLs in reversed order. I don't know what else to do, since it works for me and you seem to be getting all the files, plus some network errors that I don't know about.
ea16bd4 adds a very simple --pace
argument which simply sleeps 1s after each request. Let me know if that helps. Perhaps the host you're connecting is simply blocking you after you're sending them hundreds of requests?
I think that's the case, it looks good now. Thanks!
Media requests are throttled and eventually denied, but invoking the command again restarts the requests from the first archived post all over again. Pacing these requests might be one way to solve this, or alternatively, storing the last requested attachment would also ease the pain.