Open User670 opened 4 years ago
@user670
Is it possible for this module to continue a suspended job, skipping files that have already been saved?
Yes. Pywebcopy skips files that already exists, so you could consider it being resumed.
(Also, what are the time out thresholds and retry limits for the requests? Can I specify these values?)
No. You have to rerun the scripts/cmd manually i.e. overwrite=False in scripts or without the --overwrite flag in cmd.
(Also, can I make it print some logs if a request failed or timed out and is doing a retry?)
Yes. Set debug=True or --debug flag, then it will print logs which you could manually inspect.
(Also, what are the time out thresholds and retry limits for the requests? Can I specify these values?)
No. You have to rerun the scripts/cmd manually i.e. overwrite=False in scripts or without the --overwrite flag in cmd.
I think he was talking about the crawl delays between the requests (i.e. timeouts / pauses / wait) to prevent the high load and avoid being banned by the source.
Is it possible to set such a delay between requests? Like "--wait" in WGET.
It would be great for both sides (a source website won't be ddosed and the crawler won't be banned in the middle of the process).
I think he was talking about the crawl delays between the requests (i.e. timeouts / pauses / wait) to prevent the high load and avoid being banned by the source.
Is it possible to set such a delay between requests? Like "--wait" in WGET.
It would be great for both sides (a source website won't be ddosed and the crawler won't be banned in the middle of the process).
I don't think I got banned, and I wasn't talking about delay between requests.
What I was experiencing was, um, like, the crawling just freezes, with no messages being printed to the console for minutes, after a while, and I had to kill the process and start over (otherwise it won't move).
Trying to clone a webpage, but it froze after a while, probably due to some network hiccups. I had to kill the process and start over (only to get stuck again, to be honest). Is it possible for this module to continue a suspended job, skipping files that have already been saved?
(Also, what are the time out thresholds and retry limits for the requests? Can I specify these values?)
(Also, can I make it print some logs if a request failed or timed out and is doing a retry?)
Windows 10, Python 3.8.1. Module installed via
pip install pywebcopy
, module called by command linepython -m pywebcopy save_webpage http://y.tuwan.com/chatroom/3701 ./ --bypass_robots
.