Open rstmsn opened 3 years ago
Are the hyperlinked pages hosted on the same site domain or outside?
On the same domain. Currently, it doesn't seem to be following any hyperlinks. In my project_folder, i'm only seeing 1 .html file, despite the fact that this page contains many linked pages.
The pywebcopy builds a hierarchical structure meaning your hyperlinked pages might be in some folders relative to the main html file.
No. There is only 1 .html file, no other folders, no other files. When clicking a hyperlink within the .html file, it 404s because the package has not followed / downloaded any of the hyperlinks. Why would this be?
facing the same issue, any updates?
It could be server side site specific issue. Maybe the hyperlinks are not resolving properly due to bad url or html formatting.
It could be server side site specific issue. Maybe the hyperlinks are not resolving properly due to bad url or html formatting.
server side is fine. I just successfully cloned the same site manually, using wget. For others who might benefit from this code-free solution:
wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains example.org --no-parent www.example.org/directory/
@rstmsn for your solution, i found each of html src property was still using the origin resource. do you have any other idea?
@rstmsn for your solution, i found each of html src property was still using the origin resource. do you have any other idea?
recursive find & replace using grep?
Any update on this? Facing the same issue.
Ok just use 'save_website' function instead of save_webpage
Ok just use 'save_website' function instead of save_webpage
The issue is with the save_website
function itself. It's downloading a single page just like save_webpage
. I'm using pywebcopy 7.0.2 on macOS.
Does anyone here knows how to crawl a whole subdomain? Currently trying to test something out.
@BrandonKMLee you can modify the session object which is created in the save_website function to discard any unwanted domains. You should see the source code of save_website function and then just replace the default session with a modified one.
Let's say there are these three scenarios:
I am trying to figure out which value to change within Session
since dunno how it is tied to save_website
I'm running the following example. The target page downloads OK, however none of the linked pages are being downloaded. Is there a configuration flag I can set, to download hyperlinked pages?
`from pywebcopy import save_website
kwargs = {'project_name': 'xxxx-clone-eb'}
save_website( url='https://xxxxx.com/ARTICLES/xxxx.htm', project_folder='/Users/xxxx/Documents/Code/xxxxx/EB', **kwargs ) `