How to clone linked pages?

rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.

https://rajatomar788.github.io/pywebcopy/

Other

527 stars 106 forks source link

How to clone linked pages? #63

Open rstmsn opened 3 years ago

rstmsn commented 3 years ago

I'm running the following example. The target page downloads OK, however none of the linked pages are being downloaded. Is there a configuration flag I can set, to download hyperlinked pages?

`from pywebcopy import save_website

kwargs = {'project_name': 'xxxx-clone-eb'}

save_website( url='https://xxxxx.com/ARTICLES/xxxx.htm', project_folder='/Users/xxxx/Documents/Code/xxxxx/EB', **kwargs ) `

rajatomar788 commented 3 years ago

Are the hyperlinked pages hosted on the same site domain or outside?

rstmsn commented 3 years ago

On the same domain. Currently, it doesn't seem to be following any hyperlinks. In my project_folder, i'm only seeing 1 .html file, despite the fact that this page contains many linked pages.

rajatomar788 commented 3 years ago

The pywebcopy builds a hierarchical structure meaning your hyperlinked pages might be in some folders relative to the main html file.

rstmsn commented 3 years ago

No. There is only 1 .html file, no other folders, no other files. When clicking a hyperlink within the .html file, it 404s because the package has not followed / downloaded any of the hyperlinks. Why would this be?

unmurphy commented 3 years ago

facing the same issue, any updates?

rajatomar788 commented 3 years ago

It could be server side site specific issue. Maybe the hyperlinks are not resolving properly due to bad url or html formatting.

rstmsn commented 3 years ago

It could be server side site specific issue. Maybe the hyperlinks are not resolving properly due to bad url or html formatting.

server side is fine. I just successfully cloned the same site manually, using wget. For others who might benefit from this code-free solution:

wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains example.org --no-parent www.example.org/directory/

unmurphy commented 3 years ago

@rstmsn for your solution, i found each of html src property was still using the origin resource. do you have any other idea?

rstmsn commented 3 years ago

@rstmsn for your solution, i found each of html src property was still using the origin resource. do you have any other idea?

recursive find & replace using grep?

monim67 commented 1 year ago

Any update on this? Facing the same issue.

rajatomar788 commented 1 year ago

Ok just use 'save_website' function instead of save_webpage

monim67 commented 1 year ago

Ok just use 'save_website' function instead of save_webpage

The issue is with the save_website function itself. It's downloading a single page just like save_webpage. I'm using pywebcopy 7.0.2 on macOS.

BradKML commented 1 year ago

Does anyone here knows how to crawl a whole subdomain? Currently trying to test something out.

rajatomar788 commented 1 year ago

@BrandonKMLee you can modify the session object which is created in the save_website function to discard any unwanted domains. You should see the source code of save_website function and then just replace the default session with a modified one.

BradKML commented 1 year ago

Let's say there are these three scenarios:

You want to scrape "'https://www.nateliason.com/notes*" but nothing else around "https://www.nateliason.com", and that all child URLs of "'https://www.nateliason.com/notes/{pages}" are traceable to the parent page
You want to scrape the articles within "https://paulminors.com/resources/book-summaries" but the URLs are shortened in there, and the site also have other unrelated articles
You want to scrape the whole website (any) but not any other domain

I am trying to figure out which value to change within Session since dunno how it is tied to save_website