Crawler.crawl() only saves first page

KuroiKuro commented 4 years ago

Issue: crawler.crawl() only saves the first page of the website. Expected result: All pages of the website are downloaded Description: I am trying this out on my own local machine using python's http.server. Here is the directory structure of my test website:

.
├── 1.html
├── 2.html
├── folder
│   └── folder.html
└── index.html

Index.html contains the following information:

<!DOCTYPE html>
<p style="color:yellow">
The quick brown fox jumps over the lazy dog
</p>

<a href="1.html">1.html</a>
<a href="2.html">2.html</a>
<a href="folder/folder.html">folder.html</a>

Here is the code of my script:

from pywebcopy import Crawler, config

class Downloader:

    # Class variables
    USERAGENT = "Mozilla/5.0 (X11; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0"

    def download_website(self, url, folder):
        config.setup_config(
            project_url=url,
            project_folder=folder,
            zip_project_folder=False,
            over_write=True,
            bypass_robots=True
        )
        headers = config.get("http_headers")
        headers["User-Agent"] = self.USERAGENT
        config["http_headers"] = headers
        crawler = Crawler()
        print(f"Downloading {url} to {folder}")
        crawler.crawl()

website_file_path = "/tmp/savefiles"
url = "http://localhost:8000"
downloader = Downloader()
downloader.download_website(url, website_file_path)

Output of the code:

/home/user/script/venv/lib64/python3.6/site-packages/pywebcopy/webpage.py:84: UserWarning: Global Configuration is not setup. You can ignore this if you are going manual.This is just one time warning regarding some unexpected behavior.
  "Global Configuration is not setup. You can ignore this if you are going manual."
Downloading http://localhost:8000 to /tmp/savefiles
pywebcopy.configs - INFO     - Got response 200 from http://localhost:8000/
parsers    - INFO     - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7fe40bc2e1d0>> encoding <ISO-8859-1> and parser <<lxml.etree.HTMLParser object at 0x7fe40be7e178>>
webpage    - INFO     - Starting save_complete Action on url: ['http://localhost:8000/']
webpage    - INFO     - Starting save_assets Action on url: 'http://localhost:8000/'
webpage    - Level 100 - Queueing download of <3> asset files.
webpage    - INFO     - Starting save_html Action on url: 'http://localhost:8000/'
webpage    - INFO     - WebPage saved successfully to /tmp/savefiles/localhost/localhost/index.html
pywebcopy.configs - INFO     - Got response 200 from http://localhost:8000/2.html
parsers    - INFO     - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7fe40bc2e518>> encoding <ISO-8859-1> and parser <<lxml.etree.HTMLParser object at 0x7fe40be7e178>>
pywebcopy.configs - INFO     - Got response 200 from http://localhost:8000/1.html
parsers    - INFO     - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7fe40bc2e908>> encoding <ISO-8859-1> and parser <<lxml.etree.HTMLParser object at 0x7fe40be7e210>>
pywebcopy.configs - INFO     - Got response 200 from http://localhost:8000/folder/folder.html
parsers    - INFO     - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7fe40bc2ee48>> encoding <ISO-8859-1> and parser <<lxml.etree.HTMLParser object at 0x7fe40be7e2a8>>

After this, running the ls command at /tmp/savefiles/localhost/localhost shows that only index.html was downloaded. What I hope to achieve is to download 1.html, 2.html and folder/folder.html as well.

Python version: 3.6.8 pywebcopy==6.3.0

rajatomar788 commented 4 years ago

Try the save_website api method to if it works that way.

from pywebcopy import save_website

KuroiKuro commented 4 years ago

Hi, the save_website api method gives me the same result. Here is the output:

pywebcopy.configs - INFO     - Got response 404 from http://localhost:8000/robots.txt
pywebcopy.configs - INFO     - Got response 404 from http://localhost:8000/robots.txt
/home/user/script/venv/lib64/python3.6/site-packages/pywebcopy/webpage.py:84: UserWarning: Global Configuration is not setup. You can ignore this if you are going manual.This is just one time warning regarding some unexpected behavior.
  "Global Configuration is not setup. You can ignore this if you are going manual."
pywebcopy.configs - INFO     - Got response 200 from http://localhost:8000/
parsers    - INFO     - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7ff469e20908>> encoding <ISO-8859-1> and parser <<lxml.etree.HTMLParser object at 0x7ff46b6fe5a0>>
webpage    - INFO     - Starting save_complete Action on url: ['http://localhost:8000/']
webpage    - INFO     - Starting save_assets Action on url: 'http://localhost:8000/'
webpage    - Level 100 - Queueing download of <3> asset files.
webpage    - INFO     - Starting save_html Action on url: 'http://localhost:8000/'
pywebcopy.configs - INFO     - Got response 200 from http://localhost:8000/1.html
parsers    - INFO     - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7ff469bcdac8>> encoding <ISO-8859-1> and parser <<lxml.etree.HTMLParser object at 0x7ff469e26178>>
webpage    - INFO     - WebPage saved successfully to /tmp/savefiles/localhost/localhost/index.html
pywebcopy.configs - INFO     - Got response 200 from http://localhost:8000/2.html
parsers    - INFO     - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7ff469e13cf8>> encoding <ISO-8859-1> and parser <<lxml.etree.HTMLParser object at 0x7ff46b6fe5a0>>
pywebcopy.configs - INFO     - Got response 200 from http://localhost:8000/folder/folder.html
parsers    - INFO     - Parsing tree with source: <<urllib3.response.HTTPResponse object at 0x7ff469be7048>> encoding <ISO-8859-1> and parser <<lxml.etree.HTMLParser object at 0x7ff469e26210>>
core       - INFO     - Saved the Project as ZIP archive at /tmp/savefiles/localhost.zip
core       - INFO     - Downloaded Contents Size :: 1 KB's

It looks like it is seeing my other pages correctly, but not downloading them for some reason.

rajatomar788 commented 4 years ago

It is weird. Can you try a different site? Here is a demo site http://demo.cyotek.com

If it still doesn't work then you can

install a previous version.
Try running pywebcopy tests in your environment.
At last you can try a beta version pywebcopy 7 from http://github.com/rajatomar788/pywebcopy7/ It has all the same apis but its much more versatile.

KuroiKuro commented 4 years ago

Hi, it works for the demo website, so I think it is my site that is causing the issue.

I also want to add a delay between each request, I saw from issue #35 that I am supposed to override the get method of the SESSIONS object, but I'm not too sure how to do that. From the code it looks like SESSION is an instance of AccessAwareSession, so do I have to create a new class that inherits from AccessAwareSession, override the get method and then change SESSION to be an instance of my new class?

Something like this:

from time import sleep
class Example(AccessAwareSession):
    def get(self, url, **kwargs):
        if self._parser_ready and not self._can_access(url):
            raise AccessError("Access is not allowed by the site of url %s" % url)
        sleep(1) # Delay 1 second
        return super(Example, self).get(url, **kwargs)

And then in configs.py

SESSION = Example()

Is this the right way to implement a delay?

rajatomar788 commented 4 years ago

Yes its okay for now. Native support for delay will be available in next major version.

KuroiKuro commented 4 years ago

Alright, thank you so much for the help.

claudfernandes commented 4 years ago

Hi, I am having the same issue, it only crawl the same page...

This is the code I am using


from pywebcopy import Crawler, config, SESSION

class Downloader:
    # Class variables
    USERAGENT = "Mozilla/5.0 (X11; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0"

    def download_website(self, url, folder):
        kwargs = {
            'project_url': url,
            'project_folder': folder,
            'debug': False,
            'over_write': True,
            'bypass_robots': True,
            # 'allowed_file_ext': safe_file_exts,
            # 'http_headers': safe_http_headers,
            'load_css': False,
            'load_javascript': False,
            'load_images': False,
            'join_timeout': None,
        }

        config.setup_config(**kwargs)
        payload = {'name': '12345', 'form_id': 'user_login', 'pass': 'password', 'op': 'Log in'}
        SESSION.headers.update(payload)
        SESSION.get("http://intranet.local:8000/user/login/admin")
        r = SESSION.post("http://ntranet.local:8000/user/login/admin", data=payload)
        config["http_headers"] = SESSION.headers
        crawler = Crawler()
        # print("Downloading {url} to {folder}")
        crawler.crawl()

website_file_path = "/tmp/savefiles"
url = "http://intranet.local:8000/section/guidance"

downloader = Downloader()
downloader.download_website(url, website_file_path)```

rajatomar788 commented 4 years ago

Hey, Your implementation using class isn't the recommended way to go forward. Either you should use the direct save_website (you can still configure the session to login before the fuction call). Or you should inherit the Crawler object from the urls.py module.

rajatomar788 / pywebcopy

Crawler.crawl() only saves first page #43