kensanata / mastodon-archive

Archive your statuses, favorites and media using the Mastodon API (i.e. login required)
https://alexschroeder.ch/software/Mastodon_Archive
GNU General Public License v3.0
358 stars 33 forks source link

Media download fails roughly halfway through #22

Closed ghost closed 6 years ago

ghost commented 6 years ago

The download tends to fail after about 1400-1600 downloads, and doesn't retain the ones it succeeded in downloading on retry:

mastodon-archive media amphetamine@social.wxcafe.net
3352 urls in your backup (half of them are previews)
Downloading |###############                 | 1599/3352Traceback (most recent call last):
  File "/usr/local/bin/mastodon-archive", line 9, in <module>
    load_entry_point('mastodon-archive', 'console_scripts', 'mastodon-archive')()
  File "/home/user/mastodon-backup/mastodon_archive/__init__.py", line 65, in main
    args.command(args)
  File "/home/user/mastodon-backup/mastodon_archive/media.py", line 68, in media
    download.start(blocking = False)
  File "/usr/local/lib/python3.5/dist-packages/pySmartDL/pySmartDL.py", line 250, in start
    urlObj = urllib.request.urlopen(req, timeout=self.timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 466, in open
    response = self._open(req, data)
  File "/usr/lib/python3.5/urllib/request.py", line 484, in _open
    '_open', req)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 1297, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/usr/lib/python3.5/urllib/request.py", line 1257, in do_open
    r = h.getresponse()
  File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
    response.begin()
  File "/usr/lib/python3.5/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.5/http/client.py", line 258, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.5/socket.py", line 575, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.5/ssl.py", line 929, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.5/ssl.py", line 791, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.5/ssl.py", line 575, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
kensanata commented 6 years ago

Hm, if not a single file was saved, that means not a single download succeeded. Perhaps the ones that "succeeded" are media files that got deleted.

kensanata commented 6 years ago

One thing we might want to try is simply waiting? For now, I've removed the pySmartDL library and I'm downloading one image after another. I think this will slow things down considerably, but perhaps that's part of the problem: the code might have sent so many requests that the server decides to lock you out? The code should now also print error messages. Let me know whether you see anything interesting. Anyway, changes to media.py.

ghost commented 6 years ago
  File "/home/user/mastodon-backup/mastodon_archive/media.py", line 67
    except UrlError as e:
         ^
SyntaxError: invalid syntax

Looks like the new version has something going on, this happens immediately on run now.

kensanata commented 6 years ago

OK, I’ll take another look. It worked on my Windows machine, so I assumed it was going to be no problem. :(

kensanata commented 6 years ago

Hopefully fixed in b7ac03f.

ghost commented 6 years ago

well, it's getting somewhere, but still fails to retain media and repeats a log that looks like this:

Not Found: https://social.wxcafe.net/media_proxy/191859/original
Downloading |##############                  | 1529/3352
Not Found: https://social.wxcafe.net/media_proxy/190290/small
Downloading |##############                  | 1530/3352
Not Found: https://social.wxcafe.net/media_proxy/190290/original
Downloading |##############                  | 1557/3352
Not Found: https://social.wxcafe.net/media_proxy/190330/small
Downloading |##############                  | 1558/3352
Not Found: https://social.wxcafe.net/media_proxy/190330/original
Downloading |###############                 | 1599/3352
Not Found: https://social.wxcafe.net/media_proxy/189341/small
Downloading |###############                 | 1600/3352
Not Found: https://social.wxcafe.net/media_proxy/189341/original
Downloading |###############                 | 1601/3352
Not Found: https://social.wxcafe.net/media_proxy/189321/small
Downloading |###############                 | 1602/3352
Not Found: https://social.wxcafe.net/media_proxy/189321/original
Downloading |###############                 | 1603/3352Traceback (most recent call last):
  File "/home/user/mastodon-backup/mastodon_archive/media.py", line 63, in media
    with urllib.request.urlopen(url) as response, open(file_name, 'wb') as fp:
FileNotFoundError: [Errno 2] No such file or directory: 'social.wxcafe.net.user.amphetamine/system/media_attachments/files/000/189/299/small/media.png'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/mastodon-archive", line 9, in <module>
    load_entry_point('mastodon-archive', 'console_scripts', 'mastodon-archive')()
  File "/home/user/mastodon-backup/mastodon_archive/__init__.py", line 68, in main
    args.command(args)
  File "/home/user/mastodon-backup/mastodon_archive/media.py", line 67, in media
    print("\n" + e.msg + ": " + url, file=sys.stderr)
AttributeError: 'FileNotFoundError' object has no attribute 'msg'

Thanks for sticking with this by the way, i appreciate it.

kensanata commented 6 years ago

Downloading |############## | 1529/3352 Not Found: https://social.wxcafe.net/media_proxy/190290/small

Stuff like the above simply mean that the files are no longer available from the server we are contacting, I think. I guess we could try and request the media from the originating server? I don’t know whether that requires us getting the remote status, first, though. My first response would be to let it be. The admin deleted them from your server so that’s it. Downloading |############### | 1603/3352Traceback (most recent call last): File "/home/user/mastodon-backup/mastodon_archive/media.py", line 63, in media with urllib.request.urlopen(url) as response, open(file_name, 'wb') as fp: FileNotFoundError: [Errno 2] No such file or directory: 'social.wxcafe.net.user.amphetamine/system/media_attachments/files/000/189/299/small/media.png'

This is more surprising. Apparently we can’t create the file. Or the directories for the file. From the command line, can you write the file? Try this:

touch social.wxcafe.net.user.amphetamine/system/media_attachments/files/000/189/299/small/media.png

Can you create this file? I never get this error, so that’s confusing. Perhaps we need to force a creation of the directory before doing anything else? I’ll have to improve logging and see whether we can get to the bottom of this.

Sadly, I’m going to be busy for a day or two, so expect no changes.

ghost commented 6 years ago

yea, it looks like the directories down from 299 were not created.

kensanata commented 6 years ago

OK, 55fb633 tries to make sure every directory actually exists. I guess pySmartDL created those directories and when I removed pySmartDL and used urlopen instead, no new directories would get created.

kensanata commented 6 years ago

Did the issue get resolved? Feel free to reopen this issue if you run into problems.