kensanata / mastodon-archive

Archive your statuses, favorites and media using the Mastodon API (i.e. login required)
https://alexschroeder.ch/software/Mastodon_Archive
GNU General Public License v3.0
358 stars 33 forks source link

Media requests should also take --pace #41

Closed ghost closed 5 years ago

ghost commented 5 years ago

Media requests are throttled and eventually denied, but invoking the command again restarts the requests from the first archived post all over again. Pacing these requests might be one way to solve this, or alternatively, storing the last requested attachment would also ease the pain.

Downloading |####                            | 1810/14002                                                                
Not Found: https://social.wxcafe.net/system/media_attachments/files/001/217/485/original/2717e40cab36a23d.png?1539819022 
Downloading |####                            | 1813/14002                                                                
Not Found: https://social.wxcafe.net/system/media_attachments/files/001/217/347/small/c2de73f9f4725c5a.png?1539815814    
Downloading |####                            | 1814/14002                                                                
Not Found: https://social.wxcafe.net/system/media_attachments/files/001/217/347/original/c2de73f9f4725c5a.png?1539815814 
Downloading |####                            | 1815/14002Traceback (most recent call last):                              
  File "/usr/lib/python3.6/urllib/request.py", line 1318, in do_open                                                     
    encode_chunked=req.has_header('Transfer-encoding'))                                                                  
  File "/usr/lib/python3.6/http/client.py", line 1239, in request                                                        
    self._send_request(method, url, body, headers, encode_chunked)                                                       
  File "/usr/lib/python3.6/http/client.py", line 1285, in _send_request                                                  
    self.endheaders(body, encode_chunked=encode_chunked)                                                                 
  File "/usr/lib/python3.6/http/client.py", line 1234, in endheaders                                                     
    self._send_output(message_body, encode_chunked=encode_chunked)                                                       
  File "/usr/lib/python3.6/http/client.py", line 1026, in _send_output                                                   
    self.send(msg)                                                                                                       
  File "/usr/lib/python3.6/http/client.py", line 964, in send                                                            
    self.connect()                                                                                                       
  File "/usr/lib/python3.6/http/client.py", line 1392, in connect                                                        
    super().connect()                                                                                                    
  File "/usr/lib/python3.6/http/client.py", line 936, in connect                                                         
    (self.host,self.port), self.timeout, self.source_address)                                                            
  File "/usr/lib/python3.6/socket.py", line 724, in create_connection                                                    
    raise err                                                                                                            
  File "/usr/lib/python3.6/socket.py", line 713, in create_connection                                                    
    sock.connect(sa)                                                                                                     
OSError: [Errno 101] Network is unreachable                                         
kensanata commented 5 years ago

Ouch! Definitely a problem, yes.

kensanata commented 5 years ago

I'm looking at the code again and I see that for media, I don't log in. That also means that we can't get throttled, and it also means that we cannot "pace". I guess I could write my own pacing code, but right now the other commands can do pacing because the Mastodon library does it.

For media, we don't use the Mastodon library. We just go through the archive, get the URLs we need to fetch, and then we download the media we're missing. Thus, I get the impression that the code should handle restarts gracefully as the files from the first run are still there and don't need downloading. The question is: why doesn't this work for you?

Perhaps a simple "fix" would be to reverse the order of the URLs. Start from the end. I see that many of the media files you're trying to fetch apparently don't exist, or no longer exist (404 Not Found). Perhaps the newer media files would still exist?

ghost commented 5 years ago

That's right, this is a first run from the start of the archive. I see a media-proxy directory:

$ ls social.wxcafe.net.user.amphetamine/media_proxy/
1182186/

This directory has no files. I also see a media_attachments directory:

$ ls social.wxcafe.net.user.amphetamine/system/media_attachments/files/
000/  001/
kensanata commented 5 years ago

I just ran it again for myself:

[...]
Downloading |############################    | 88/100
Forbidden: https://s3.us-west-2.amazonaws.com/dice-camp-mastodon/media_attachments/files/000/196/393/original/d75241f79a954797.jpeg
Downloading |############################### | 99/100
Forbidden: https://s3.us-west-2.amazonaws.com/dice-camp-mastodon/media_attachments/files/000/360/544/small/59cb8fb9abe4617b.png
Downloading |################################| 100/100
Forbidden: https://s3.us-west-2.amazonaws.com/dice-camp-mastodon/media_attachments/files/000/360/544/original/59cb8fb9abe4617b.png

44 downloads failed

As you can see, 44 of 100 downloads failed. If I rerun it, it will attempt these 44 downloads again.

Apparently, the files are all in the correct media directory:

~/Documents/Mastodon$ ls dice.camp.user.kensanata
dice-camp-mastodon  media_attachments

And I seem to have all 54 files:

~/Documents/Mastodon$ find dice.camp.user.kensanata -name '*.jpg' -o -name '*.jpeg' -o -name '*.png' | wc -l
54

So, currently all I can say it that it seems to "work for me".

Looking at what you wrote, it might very well be that it works for you as well. If you run find social.wxcafe.net.user.amphetamine -type f | wc -l you should get a file count over all the subdirectories.

So I think what I'll do just now is I'll start downloads in reverse order. Perhaps that simply gives us the best chances of downloading files that are still available on the server.

kensanata commented 5 years ago

28e4e58 downloads URLs in reversed order. I don't know what else to do, since it works for me and you seem to be getting all the files, plus some network errors that I don't know about.

kensanata commented 5 years ago

ea16bd4 adds a very simple --pace argument which simply sleeps 1s after each request. Let me know if that helps. Perhaps the host you're connecting is simply blocking you after you're sending them hundreds of requests?

ghost commented 5 years ago

I think that's the case, it looks good now. Thanks!