Closed KuroiKuro closed 4 years ago
Try the save_website
api method to if it works that way.
from pywebcopy import save_website
Hi, the save_website api method gives me the same result. Here is the output:
pywebcopy.configs - INFO - Got response 404 from http://localhost:8000/robots.txt
pywebcopy.configs - INFO - Got response 404 from http://localhost:8000/robots.txt
/home/user/script/venv/lib64/python3.6/site-packages/pywebcopy/webpage.py:84: UserWarning: Global Configuration is not setup. You can ignore this if you are going manual.This is just one time warning regarding some unexpected behavior.
"Global Configuration is not setup. You can ignore this if you are going manual."
pywebcopy.configs - INFO - Got response 200 from http://localhost:8000/
parsers - INFO - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7ff469e20908>> encoding <ISO-8859-1> and parser <<lxml.etree.HTMLParser object at 0x7ff46b6fe5a0>>
webpage - INFO - Starting save_complete Action on url: ['http://localhost:8000/']
webpage - INFO - Starting save_assets Action on url: 'http://localhost:8000/'
webpage - Level 100 - Queueing download of <3> asset files.
webpage - INFO - Starting save_html Action on url: 'http://localhost:8000/'
pywebcopy.configs - INFO - Got response 200 from http://localhost:8000/1.html
parsers - INFO - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7ff469bcdac8>> encoding <ISO-8859-1> and parser <<lxml.etree.HTMLParser object at 0x7ff469e26178>>
webpage - INFO - WebPage saved successfully to /tmp/savefiles/localhost/localhost/index.html
pywebcopy.configs - INFO - Got response 200 from http://localhost:8000/2.html
parsers - INFO - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7ff469e13cf8>> encoding <ISO-8859-1> and parser <<lxml.etree.HTMLParser object at 0x7ff46b6fe5a0>>
pywebcopy.configs - INFO - Got response 200 from http://localhost:8000/folder/folder.html
parsers - INFO - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7ff469be7048>> encoding <ISO-8859-1> and parser <<lxml.etree.HTMLParser object at 0x7ff469e26210>>
core - INFO - Saved the Project as ZIP archive at /tmp/savefiles/localhost.zip
core - INFO - Downloaded Contents Size :: 1 KB's
It looks like it is seeing my other pages correctly, but not downloading them for some reason.
It is weird. Can you try a different site? Here is a demo site http://demo.cyotek.com
If it still doesn't work then you can
Hi, it works for the demo website, so I think it is my site that is causing the issue.
I also want to add a delay between each request, I saw from issue #35 that I am supposed to override the get method of the SESSIONS object, but I'm not too sure how to do that. From the code it looks like SESSION is an instance of AccessAwareSession, so do I have to create a new class that inherits from AccessAwareSession, override the get method and then change SESSION to be an instance of my new class?
Something like this:
from time import sleep
class Example(AccessAwareSession):
def get(self, url, **kwargs):
if self._parser_ready and not self._can_access(url):
raise AccessError("Access is not allowed by the site of url %s" % url)
sleep(1) # Delay 1 second
return super(Example, self).get(url, **kwargs)
And then in configs.py
SESSION = Example()
Is this the right way to implement a delay?
Yes its okay for now. Native support for delay will be available in next major version.
Alright, thank you so much for the help.
Hi, I am having the same issue, it only crawl the same page...
This is the code I am using
from pywebcopy import Crawler, config, SESSION
class Downloader:
# Class variables
USERAGENT = "Mozilla/5.0 (X11; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0"
def download_website(self, url, folder):
kwargs = {
'project_url': url,
'project_folder': folder,
'debug': False,
'over_write': True,
'bypass_robots': True,
# 'allowed_file_ext': safe_file_exts,
# 'http_headers': safe_http_headers,
'load_css': False,
'load_javascript': False,
'load_images': False,
'join_timeout': None,
}
config.setup_config(**kwargs)
payload = {'name': '12345', 'form_id': 'user_login', 'pass': 'password', 'op': 'Log in'}
SESSION.headers.update(payload)
SESSION.get("http://intranet.local:8000/user/login/admin")
r = SESSION.post("http://ntranet.local:8000/user/login/admin", data=payload)
config["http_headers"] = SESSION.headers
crawler = Crawler()
# print("Downloading {url} to {folder}")
crawler.crawl()
website_file_path = "/tmp/savefiles"
url = "http://intranet.local:8000/section/guidance"
downloader = Downloader()
downloader.download_website(url, website_file_path)```
Hey,
Your implementation using class isn't the recommended way to go forward.
Either you should use the direct save_website
(you can still configure the session to login before the fuction call).
Or you should inherit the Crawler
object from the urls.py
module.
Issue: crawler.crawl() only saves the first page of the website. Expected result: All pages of the website are downloaded Description: I am trying this out on my own local machine using python's http.server. Here is the directory structure of my test website:
Index.html contains the following information:
Here is the code of my script:
Output of the code:
After this, running the ls command at /tmp/savefiles/localhost/localhost shows that only index.html was downloaded. What I hope to achieve is to download 1.html, 2.html and folder/folder.html as well.
Python version: 3.6.8 pywebcopy==6.3.0