uniAIDevs / onion-cloner

Onion website cloner
Apache License 2.0
0 stars 0 forks source link

Sweep: Allow for full website cloning #1

Closed uniAIDevs closed 6 months ago

uniAIDevs commented 6 months ago
Checklist - [X] Create `utils.py` ✓ https://github.com/uniAIDevs/onion-cloner/commit/f657ea4b306a0ab2f4e62679216a8e0ee174b5bd [Edit](https://github.com/uniAIDevs/onion-cloner/edit/sweep/allow_for_full_website_cloning/utils.py) - [X] Running GitHub Actions for `utils.py` ✓ [Edit](https://github.com/uniAIDevs/onion-cloner/edit/sweep/allow_for_full_website_cloning/utils.py) - [X] Modify `singlepage_colner.py` ✓ https://github.com/uniAIDevs/onion-cloner/commit/9beea760bef21303aae2dabc0495c90fee185bab [Edit](https://github.com/uniAIDevs/onion-cloner/edit/sweep/allow_for_full_website_cloning/singlepage_colner.py) - [X] Running GitHub Actions for `singlepage_colner.py` ✓ [Edit](https://github.com/uniAIDevs/onion-cloner/edit/sweep/allow_for_full_website_cloning/singlepage_colner.py)
sweep-ai[bot] commented 6 months ago

🚀 Here's the PR! #3

See Sweep's progress at the progress dashboard!
💎 Sweep Pro: I'm using GPT-4. You have unlimited GPT-4 tickets. (tracking ID: d520bd8a9d)
Install Sweep Configs: Pull Request

[!TIP] I can email you next time I complete a pull request if you set up your email here!


Actions (click)


Step 1: 🔎 Searching

I found the following snippets in your repository. I will now analyze these snippets and come up with a plan.

Some code snippets I think are relevant in decreasing order of relevance (click to expand). If some file is missing from here, you can mention the path in the ticket description. https://github.com/uniAIDevs/onion-cloner/blob/452fd1e4029a7d658d634907fdb0ac584e283986/singlepage_colner.py#L1-L105

Step 2: ⌨️ Coding

Ran GitHub Actions for f657ea4b306a0ab2f4e62679216a8e0ee174b5bd:

--- 
+++ 
@@ -1,6 +1,7 @@
 from bs4 import BeautifulSoup
 import requests
 import os
+from utils import extract_internal_links, is_valid_link

 port = input("Socks listener port: ") # Port that Tor Socks listener working on
@@ -73,7 +74,8 @@
 def download_image(file, link):
     r = requests.get(link, stream=True, proxies=proxies)
     if r.status_code == 200:
-        with open( direactory + "\\" + file, 'wb') as f:
+        os.makedirs(os.path.dirname(direactory + "\\" + file), exist_ok=True)
+        with open(direactory + "\\" + file, 'wb') as f:
             for chunk in r.iter_content(1024):
                 f.write(chunk)

Ran GitHub Actions for 9beea760bef21303aae2dabc0495c90fee185bab:


Step 3: 🔁 Code Review

I have finished reviewing the code for completeness. I did not find errors for sweep/allow_for_full_website_cloning.


🎉 Latest improvements to Sweep:
  • New dashboard launched for real-time tracking of Sweep issues, covering all stages from search to coding.
  • Integration of OpenAI's latest Assistant API for more efficient and reliable code planning and editing, improving speed by 3x.
  • Use the GitHub issues extension for creating Sweep issues directly from your editor.

💡 To recreate the pull request edit the issue title or description. Something wrong? Let us know.

This is an automated message generated by Sweep AI.

codeautopilot[bot] commented 6 months ago

Potential solution

To enable full website cloning, we need to modify singlepage_cloner.py to recursively follow and clone internal links, handle errors, respect robots.txt, manage assets, avoid duplicates, and maintain the website's directory structure. In utils.py, we need to improve the extract_internal_links function, add URL normalization, respect robots.txt, optimize for performance, and add error handling.

How to implement

singlepage_cloner.py

  1. Add a function to check robots.txt:

    def is_allowed_by_robots(url):
    rp = robotparser.RobotFileParser()
    rp.set_url(urlparse(url).scheme + '://' + urlparse(url).netloc + '/robots.txt')
    rp.read()
    return rp.can_fetch("*", url)
  2. Modify the clone_page function to recursively clone the website:

    def clone_website(link, visited=None):
    if visited is None:
        visited = set()
    
    if link in visited or not is_allowed_by_robots(link):
        return
    visited.add(link)
    
    try:
        html = download_page(link)
        # ... existing code to download assets ...
        internal_links = extract_internal_links(html, link)
    
        for internal_link in internal_links:
            if is_valid_link(internal_link) and internal_link not in visited:
                clone_website(internal_link, visited)
    except requests.exceptions.RequestException as e:
        print(f"Error downloading {link}: {e}")
  3. Call clone_website with the user-provided link:

    clone_link = input("Enter Link to clone: ")
    clone_website(clone_link)

utils.py

  1. Improve the extract_internal_links function:

    def extract_internal_links(html, base_url):
    soup = BeautifulSoup(html, 'html.parser')
    links = soup.find_all('a')
    internal_urls = set()
    base_domain = urlparse(base_url).netloc
    for link in links:
        href = link.get('href')
        if href and is_valid_link(href):
            full_url = urljoin(base_url, href)
            parsed_href = urlparse(full_url)
            href_domain = parsed_href.netloc
            if href_domain == base_domain:
                internal_urls.add(full_url)
    return list(internal_urls)
  2. Add a function to check robots.txt:

    def check_robots_txt(url):
    parsed_url = urlparse(url)
    robots_url = urljoin(parsed_url.scheme + '://' + parsed_url.netloc, 'robots.txt')
    try:
        response = requests.get(robots_url)
        # Parse the robots.txt here and return whether the URL is allowed or not
        # This is a placeholder for the actual implementation
        return True
    except requests.exceptions.RequestException:
        return False
  3. Add a function to validate links:

    def is_valid_link(url):
    parsed_url = urlparse(url)
    if parsed_url.scheme not in ['http', 'https']:
        return False
    if parsed_url.netloc == "":
        return True
    if parsed_url.scheme == 'mailto' or parsed_url.path.startswith('javascript:'):
        return False
    # Exclude URLs that do not lead to HTML pages
    if parsed_url.path.split('.')[-1] not in ['html', 'htm', 'php', 'asp']:
        return False
    return True

The above code snippets provide a framework for the necessary changes to enable full website cloning. The actual implementation may require additional error handling, path normalization, and other considerations. The check_robots_txt function in utils.py will need to be fully implemented to parse and respect the rules in robots.txt.

Click here to create a Pull Request with the proposed solution

Files used for this task:

Changes on singlepage_cloner.py To extend `singlepage_cloner.py` to recursively follow internal links and clone entire websites, including all pages and associated assets, while also implementing error handling for broken links and respecting robots.txt rules, we can follow these steps: 1. **Parse robots.txt**: Before starting the cloning process, we should check the `robots.txt` file of the target website to ensure we are allowed to clone it. We can use the `robotparser` module from Python's standard library to parse and check the rules. 2. **Recursive Cloning**: Modify the `clone_page` function to not only clone the single page but also to look for internal links and recursively clone those pages as well. We will use the `extract_internal_links` function from `utils.py` to get all internal links from the page. 3. **Error Handling**: Implement error handling to manage broken links and other possible HTTP errors that may occur during the cloning process. This can be done by checking the response status code and handling exceptions. 4. **Asset Management**: Ensure that all assets (CSS, JS, images) are downloaded and correctly linked within the cloned pages. This is partially handled by existing functions like `get_css_files`, `get_js_files`, and `get_images`, but they may need to be adjusted to handle the recursive nature of the task. 5. **Avoiding Duplicates**: Keep track of already cloned pages and assets to avoid downloading them multiple times. 6. **File and Directory Structure**: Maintain a proper directory structure that mirrors the structure of the original website to ensure that all internal links work correctly after cloning. Here is a rough outline of the code changes that need to be made: ```python import robotparser # ... existing code ... def is_allowed_by_robots(url): rp = robotparser.RobotFileParser() rp.set_url(url + '/robots.txt') rp.read() return rp.can_fetch("*", url) def clone_website(link, visited=None): if visited is None: visited = set() if link in visited or not is_allowed_by_robots(link): return visited.add(link) try: html = download_page(link) # ... existing code to download assets ... internal_links = extract_internal_links(html, link) for internal_link in internal_links: if is_valid_link(internal_link) and internal_link not in visited: clone_website(internal_link, visited) except requests.exceptions.RequestException as e: print(f"Error downloading {link}: {e}") # ... existing code ... clone_link = input("Enter Link to clone: ") clone_website(clone_link) ``` This is a high-level plan, and the actual implementation may require more detailed error handling, path normalization, and other considerations. The `extract_internal_links` function from `utils.py` will be crucial in this process, and it should be optimized for performance and accuracy as it will be called multiple times during the cloning process.
Changes on utils.py To address the task for `utils.py`, we need to update and possibly add new functions to handle the extraction and validation of all internal links for recursive downloading. Here's a practical proposal to enhance the existing code: 1. **Improve `extract_internal_links` Function:** - Ensure that the function returns absolute URLs instead of relative paths. - Filter out links that do not lead to HTML pages (e.g., links to images, PDFs, etc.). 2. **Add Function to Normalize URLs:** - Create a function to convert relative URLs to absolute URLs based on the base URL. - Ensure that fragment identifiers (e.g., `#section`) are removed since they do not lead to different pages. 3. **Add Function to Respect `robots.txt`:** - Implement a function that checks `robots.txt` rules for the base domain and filters out disallowed URLs. 4. **Optimize for Performance:** - Use sets instead of lists for storing URLs to avoid duplicates and improve lookup times. - Consider using a session object if multiple requests to the same domain are made, to reuse the underlying TCP connection. 5. **Error Handling:** - Add error handling in the link extraction process to gracefully handle cases where the HTML parsing fails. Here's how the updated `utils.py` might look: ```python from urllib.parse import urlparse, urljoin import requests from bs4 import BeautifulSoup def extract_internal_links(html, base_url): soup = BeautifulSoup(html, 'html.parser') links = soup.find_all('a') internal_urls = set() base_domain = urlparse(base_url).netloc for link in links: href = link.get('href') if href and is_valid_link(href): full_url = urljoin(base_url, href) parsed_href = urlparse(full_url) href_domain = parsed_href.netloc if href_domain == base_domain: internal_urls.add(full_url) return list(internal_urls) def is_valid_link(url): parsed_url = urlparse(url) if parsed_url.scheme not in ['http', 'https']: return False if parsed_url.netloc == "": return True if parsed_url.scheme == 'mailto' or parsed_url.path.startswith('javascript:'): return False # Exclude URLs that do not lead to HTML pages if parsed_url.path.split('.')[-1] not in ['html', 'htm', 'php', 'asp']: return False return True def check_robots_txt(url): parsed_url = urlparse(url) robots_url = urljoin(parsed_url.scheme + '://' + parsed_url.netloc, 'robots.txt') try: response = requests.get(robots_url) # Parse the robots.txt here and return whether the URL is allowed or not # This is a placeholder for the actual implementation return True except requests.exceptions.RequestException: return False # Add any additional functions or enhancements here ``` This updated code provides a more robust foundation for the task at hand. The actual implementation of `check_robots_txt` would require parsing the `robots.txt` file and determining if the URL is allowed, which is beyond the scope of this example but should be considered in the final implementation.