rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
527 stars 106 forks source link

Question: can it continue a suspended job? #55

Open User670 opened 4 years ago

User670 commented 4 years ago

Trying to clone a webpage, but it froze after a while, probably due to some network hiccups. I had to kill the process and start over (only to get stuck again, to be honest). Is it possible for this module to continue a suspended job, skipping files that have already been saved?

(Also, what are the time out thresholds and retry limits for the requests? Can I specify these values?)

(Also, can I make it print some logs if a request failed or timed out and is doing a retry?)

Windows 10, Python 3.8.1. Module installed via pip install pywebcopy, module called by command line python -m pywebcopy save_webpage http://y.tuwan.com/chatroom/3701 ./ --bypass_robots.

rajatomar788 commented 4 years ago

@user670

Is it possible for this module to continue a suspended job, skipping files that have already been saved?

Yes. Pywebcopy skips files that already exists, so you could consider it being resumed.

(Also, what are the time out thresholds and retry limits for the requests? Can I specify these values?)

No. You have to rerun the scripts/cmd manually i.e. overwrite=False in scripts or without the --overwrite flag in cmd.

(Also, can I make it print some logs if a request failed or timed out and is doing a retry?)

Yes. Set debug=True or --debug flag, then it will print logs which you could manually inspect.

dibarpyth commented 3 years ago

(Also, what are the time out thresholds and retry limits for the requests? Can I specify these values?)

No. You have to rerun the scripts/cmd manually i.e. overwrite=False in scripts or without the --overwrite flag in cmd.

I think he was talking about the crawl delays between the requests (i.e. timeouts / pauses / wait) to prevent the high load and avoid being banned by the source.

Is it possible to set such a delay between requests? Like "--wait" in WGET.

It would be great for both sides (a source website won't be ddosed and the crawler won't be banned in the middle of the process).

User670 commented 3 years ago

I think he was talking about the crawl delays between the requests (i.e. timeouts / pauses / wait) to prevent the high load and avoid being banned by the source.

Is it possible to set such a delay between requests? Like "--wait" in WGET.

It would be great for both sides (a source website won't be ddosed and the crawler won't be banned in the middle of the process).

I don't think I got banned, and I wasn't talking about delay between requests.

What I was experiencing was, um, like, the crawling just freezes, with no messages being printed to the console for minutes, after a while, and I had to kill the process and start over (otherwise it won't move).