Closed chstrehlow closed 1 year ago
The pdfs are not downloaded because they are not on the same domain server hence the process marks it external and skips it entirely.
But the links point at the same domain: https://www.akkufit-berger.de/kataloge/#akkus https://www.akkufit-berger.de/wp-content/uploads/2018/10/EndkundenKatalog-Back-Up-Akkus.pdf The IP address is also the same?
I already noticed the “domain behavior” on different site. There was a link pointing to same server, but was missing the “WWW”. (http://example.com/file.ext instead of http://www.example.com/file.ext) and it seems like it was treated like an external link. Is there a way to whitelist external domains or to use placeholders?
I just checked the project again. No it doesn't allow pdf downloading as of now to avoid bandwidth issues. It could be available in future versions.
On Thu, Jan 23, 2020, 6:04 PM chstrehlow notifications@github.com wrote:
But the links point at the same domain: https://www.akkufit-berger.de/kataloge/#akkus
https://www.akkufit-berger.de/wp-content/uploads/2018/10/EndkundenKatalog-Back-Up-Akkus.pdf The IP address is also the same?
I already noticed the “domain behavior” on different site. There was a link pointing to same server, but was missing the “WWW”. ( http://example.com/file.ext instead of http://www.example.com/file.ext) and it seems like it was treated like an external link. Is there a way to whitelist external domains or to use placeholders?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rajatomar788/pywebcopy/issues/27?email_source=notifications&email_token=AIGSNTWBYDDF5TP2HTBFQZLQ7GFEFA5CNFSM4KKCP2BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXG5RQ#issuecomment-577662662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGSNTRNJSF4FYBDKXA3KI3Q7GFEFANCNFSM4KKCP2BA .
Whitelisting is not available in the current version.
But there is an hack I built for making URLs absolute so that you can download any of the pdfs manually by just clicking on them.
https://drive.google.com/file/d/0B6XyXxdVDjXIQTYwSVpmaF9ETldTcnNQeXVKZ0VKNUFBQVhN/view?usp=sharing
On Thu, Jan 23, 2020, 6:04 PM chstrehlow notifications@github.com wrote:
But the links point at the same domain: https://www.akkufit-berger.de/kataloge/#akkus
https://www.akkufit-berger.de/wp-content/uploads/2018/10/EndkundenKatalog-Back-Up-Akkus.pdf The IP address is also the same?
I already noticed the “domain behavior” on different site. There was a link pointing to same server, but was missing the “WWW”. ( http://example.com/file.ext instead of http://www.example.com/file.ext) and it seems like it was treated like an external link. Is there a way to whitelist external domains or to use placeholders?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rajatomar788/pywebcopy/issues/27?email_source=notifications&email_token=AIGSNTWBYDDF5TP2HTBFQZLQ7GFEFA5CNFSM4KKCP2BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXG5RQ#issuecomment-577662662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGSNTRNJSF4FYBDKXA3KI3Q7GFEFANCNFSM4KKCP2BA .
Checking status on auto-buffers such that websites would not flag the network as such (or maybe have distributed crawlers to help out)
So ok in the new pywebcopy 7 you can just create a new GenericResource which could download the pdfs after checking the content type of the response. You would have to read the elements.py file to do it manually
I tried to clone a complete website and noticed that the PDF files were skipped. This is the code I currently use:
But
Produced the same result.
One of the URLs I tested was: https://www.akkufit-berger.de/kataloge/#akkus As far as I can see the PDF extension is part of the “safe_file_exts”, which is the default option.
Even if I point the URL directly to the PDF file, it just downloads an html file which has a different file size as the original PDF and cannot be opened with the browser, or the PDF viewer.