save_website/crawl() does not download PDF

chstrehlow commented 4 years ago

I tried to clone a complete website and noticed that the PDF files were skipped. This is the code I currently use:

config.setup_config(    project_url=URL,
            project_folder=ProjectFolder,
            project_name=ProjectName,
            bypass_robots=True,
            )

crawler = Crawler()
crawler.crawl()

But

save_website(   url='http://example-site.com/index.html',
        project_folder='path/to/downloads',
        **kwargs
        )

Produced the same result.

One of the URLs I tested was: https://www.akkufit-berger.de/kataloge/#akkus As far as I can see the PDF extension is part of the “safe_file_exts”, which is the default option.

Even if I point the URL directly to the PDF file, it just downloads an html file which has a different file size as the original PDF and cannot be opened with the browser, or the PDF viewer.

rajatomar788 commented 4 years ago

The pdfs are not downloaded because they are not on the same domain server hence the process marks it external and skips it entirely.

chstrehlow commented 4 years ago

But the links point at the same domain: https://www.akkufit-berger.de/kataloge/#akkus https://www.akkufit-berger.de/wp-content/uploads/2018/10/EndkundenKatalog-Back-Up-Akkus.pdf The IP address is also the same?

I already noticed the “domain behavior” on different site. There was a link pointing to same server, but was missing the “WWW”. (http://example.com/file.ext instead of http://www.example.com/file.ext) and it seems like it was treated like an external link. Is there a way to whitelist external domains or to use placeholders?

rajatomar788 commented 4 years ago

I just checked the project again. No it doesn't allow pdf downloading as of now to avoid bandwidth issues. It could be available in future versions.

On Thu, Jan 23, 2020, 6:04 PM chstrehlow notifications@github.com wrote:

But the links point at the same domain: https://www.akkufit-berger.de/kataloge/#akkus

https://www.akkufit-berger.de/wp-content/uploads/2018/10/EndkundenKatalog-Back-Up-Akkus.pdf The IP address is also the same?

I already noticed the “domain behavior” on different site. There was a link pointing to same server, but was missing the “WWW”. ( http://example.com/file.ext instead of http://www.example.com/file.ext) and it seems like it was treated like an external link. Is there a way to whitelist external domains or to use placeholders?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rajatomar788/pywebcopy/issues/27?email_source=notifications&email_token=AIGSNTWBYDDF5TP2HTBFQZLQ7GFEFA5CNFSM4KKCP2BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXG5RQ#issuecomment-577662662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGSNTRNJSF4FYBDKXA3KI3Q7GFEFANCNFSM4KKCP2BA .

rajatomar788 commented 4 years ago

Whitelisting is not available in the current version.

But there is an hack I built for making URLs absolute so that you can download any of the pdfs manually by just clicking on them.

https://drive.google.com/file/d/0B6XyXxdVDjXIQTYwSVpmaF9ETldTcnNQeXVKZ0VKNUFBQVhN/view?usp=sharing

On Thu, Jan 23, 2020, 6:04 PM chstrehlow notifications@github.com wrote:

But the links point at the same domain: https://www.akkufit-berger.de/kataloge/#akkus

https://www.akkufit-berger.de/wp-content/uploads/2018/10/EndkundenKatalog-Back-Up-Akkus.pdf The IP address is also the same?

I already noticed the “domain behavior” on different site. There was a link pointing to same server, but was missing the “WWW”. ( http://example.com/file.ext instead of http://www.example.com/file.ext) and it seems like it was treated like an external link. Is there a way to whitelist external domains or to use placeholders?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rajatomar788/pywebcopy/issues/27?email_source=notifications&email_token=AIGSNTWBYDDF5TP2HTBFQZLQ7GFEFA5CNFSM4KKCP2BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXG5RQ#issuecomment-577662662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGSNTRNJSF4FYBDKXA3KI3Q7GFEFANCNFSM4KKCP2BA .

BradKML commented 1 year ago

Checking status on auto-buffers such that websites would not flag the network as such (or maybe have distributed crawlers to help out)

rajatomar788 commented 1 year ago

So ok in the new pywebcopy 7 you can just create a new GenericResource which could download the pdfs after checking the content type of the response. You would have to read the elements.py file to do it manually

rajatomar788 / pywebcopy

save_website/crawl() does not download PDF #27