Skip crawling and replacement of other domains

rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.

https://rajatomar788.github.io/pywebcopy/

Other

520 stars 105 forks source link

Skip crawling and replacement of other domains #94

Open macwinnie opened 2 years ago

macwinnie commented 2 years ago

Hi,

is there a configuration possibility to skip other domains from being crawled, so I only copy the whole website-stack of example.com but won't download remote links like google.com, facebook.com and so on?

I want to follow the basic documentation to copy one complete website – as:

from pywebcopy import save_website

kwargs = {
    'project_name': 'example',
}

save_website(
    url='https://example.com',
    project_folder='web_example',
    **kwargs
)

Thanks a lot! Best macwinnie

rajatomar788 commented 2 years ago

Hey, This will be possible in pywebcopy 7. Currently its only in source code on github repo.

If you want that functionality of domain exclusion, then you have to go through pywebcopy 7 manually.

monim67 commented 1 year ago

How to configure this on pywebcopy 7?

rajatomar788 commented 1 year ago

so the Session object has an attribute .domain_blacklist it is an set or list object which is empty by default. Now if you want to skip the downloading from a certain domain, then just put the domain in a string format to the list and that's all. from then onwards the crawler will skip any page or file from that domain.

monim67 commented 1 year ago

If we use domain_blacklist, resources from blacklisted domains don't get downloaded, but html links points to local resources which were not downloaded.

<a class="navbar-item" href="../blacklisted-domain/style.css">

rajatomar788 commented 1 year ago

Blacklisting is done by session object separately, so it poses as an unreachable link while the localiser still makes them localised. I guess its not a bug its a feature. On a serious note blocking should be done in scheduler so that it prevents this from happening. Scheduler also needs a queue.