Sometimes Images Get An Incorrect 404

sliceofcake commented 8 years ago

There was an image that got a 404 for me at first, but when I went through the images on my computer I noticed that bad file, deleted it, and when I ran the DL script again, it downloaded just fine [although I did load that image manually in my browser before redoing the update process, and it initially had a broken image link symbol, before I refreshed].

I'm guessing that issues can arise with the downloads. Maybe there could be a way to identify that a jpg or png isn't correct [it'll be a text file with a 404 or something, or blank maybe, but it'll be missing the magic identifier at the start of the file https://en.wikipedia.org/wiki/List_of_file_signatures]. From that point, it could either retry the download, or maybe just remove the incorrect file and wait for the user to re-run the script another day, when/if success chances will be higher at that later time.

sliceofcake commented 8 years ago

Since each download curl request is dependent on a successful, non [[[grep -q "404 Not Found"]]]-response header-curl request immediately before it, I can't see the realistic possibility that the header request successfully doesn't sniff a 404, but the actual download is a 404.

Keeping this report for the reason that maybe in the future, it becomes apparent that this event ordering can occur.

For now, single-gifs are supported, which would trigger a 404. Really though, if it redownloaded correctly, I can't see how that would happen, and I haven't encountered the problem since.

sliceofcake commented 8 years ago

Since it's in the Python script now, a 404 will throw an exception, so it's impossible for an image to be fed with 404 data instead of image data. When you run the script in full, any not-downloaded images will be properly downloaded. For all but the oddest corruption cases, I'm calling this solved.

sliceofcake / PixivMediaScraper

Sometimes Images Get An Incorrect 404 #5