rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
527 stars 106 forks source link

site restrictions #28

Open marshonhuckleberry opened 4 years ago

marshonhuckleberry commented 4 years ago

works on some websites but in others it fails, i looked in issues for any solution for "permission error" found one i ignored robots.txt but it still gets permission error, but there is just a small difference with robots txt bypass it downloads 1 more page than before, no chance with this site "http://mathworld.wolfram.com/"

rajatomar788 commented 4 years ago

What code are you using? I need to see the log file if you find it properly.

On Thu, Jan 23, 2020, 1:07 PM marshonhuckleberry notifications@github.com wrote:

works on some websites but in others it fails, i looked in issues for any solution for "permission error" found one i ignored robots.txt but it still gets permission error, but there is just a small difference with robots txt bypass it downloads 1 more page than before, no chance with this site " http://mathworld.wolfram.com/"

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rajatomar788/pywebcopy/issues/28?email_source=notifications&email_token=AIGSNTWJATI3AAJWIBNUD73Q7FCKZA5CNFSM4KKSAFWKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IIFGHJQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGSNTUC7HVRRCKJQUT5WVDQ7FCKZANCNFSM4KKSAFWA .

marshonhuckleberry commented 4 years ago

the code: import pywebcopy import requests from pywebcopy import save_webpage

pywebcopy.SESSION.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36' kwargs = {'project_name': 'new'}

save_webpage( url='http://mathworld.wolfram.com/topics/', project_folder='path', bypass_robots=True, debug=True, **kwargs ) the log file: pywebcopy_log.log

rajatomar788 commented 4 years ago

Try setting up the user-agent in the pywebcopy.config so that it changes it across the project.


import pywebcopy

pywebcopy.config['http_headers']['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'

pywebcopy.config.setup_config("http://mathworld.wolfram.com/", "path", project_name="new", bypass_robots=True)

pywebcopy.save_webpage("http://mathworld.wolfram.com/", "path")
marshonhuckleberry commented 4 years ago

pywebcopy_log.log

marshonhuckleberry commented 4 years ago

error!