opsdisk / metagoofil

Search Google and download specific file types
Other
405 stars 85 forks source link

fix halting due to wrong file name or url #23

Closed DKanarsky closed 3 years ago

DKanarsky commented 3 years ago

Sometimes threads drop with exception and halt due to wrong urls:

[Errno 22] Invalid argument: 'folder\file.php?id=5394'

or incorrect filenames for OS:

[Errno 2] No such file or directory: 'folder\%D0%93%D1%80%D0%B0%D0%B6%D0%B4%D0%B0%D0%BD%D1%81%D1%82%D0%B2%D0%BE%D0%B4%D0%BB%D1%8F%D0%B8%D0%BD%D0%BE%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%BD%D1%8B%D1%85%D1%81%D1%82%D1%83%D0%B4%D0%B5%D0%BD%D1%82%D0%BE%D0%B2%D0%B3%D0%BE%D1%81%D1%83%D0%B4%D0%B0%D1%80%D1%81%D1%82%D0%B2%D0%B5%D0%BD%D0%BD%D1%8B%D1%85_%D0%92%D0%A3%D0%97%D0%BE%D0%B2.docx

opsdisk commented 3 years ago

Thanks for submitting this @DKanarsky! Give me a day or two to take a look. In the meantime, can you provide the:

1) Operating System 2) The full command you ran

DKanarsky commented 3 years ago
  1. Microsoft Windows [Version 10.0.19042.867]
  2. python metagoofil.py -d rudn.ru -o folder -t doc,docx,pdf

Due to forbidden filename characters and name length limitation for Windows i think it's worth to "normalize" (and url decode for non US-ASCII symbols) local filenames. I'm going to deal with this in a while.

opsdisk commented 3 years ago

Using the same command, I also got a

OSError: [Errno 36] File name too long

on Linux. Providing an OSError exception would be a temporary fix (I agree with your PR). In the spirit of trying to make it more robust and elegant, give me a few days to figure out if there's a better way to manage long and non US-ASCII file names.

opsdisk commented 3 years ago

@DKanarsky - do a git pull to grab the latest. For the changes, see https://github.com/opsdisk/metagoofil/pull/24

opsdisk commented 3 years ago

Hi @DKanarsky Did you get a chance to try the latest code?

opsdisk commented 3 years ago

Closing this one out. Let me know if you still run into any issues with it.