unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.62k stars 1.23k forks source link

`fix: improve relative URL handling` #232

Closed yulin0629 closed 2 weeks ago

yulin0629 commented 3 weeks ago

Replaces manual URL path resolution with Python's urljoin for more robust and standardized URL normalization. This prevents edge cases and ensures compliance with RFC 3986 URL specifications.

The change simplifies maintenance while handling all relative path cases correctly, including fragments and parent directory traversal.

Fixes: #231

crelocks commented 3 weeks ago

@yulin0629 I wrote this internally for my own purposes. In my testing, it handles all use cases. Not sure if this is helpful

from urllib.parse import urljoin, urlparse, urlunparse, ParseResult

# Input URL example
input_url = "https://abc.com/news"

# Helper function to correctly construct full URLs
def get_full_url(input_url, partial_url):
    # Step 1: Parse the input URL to get the base URL and path
    parsed_input_url = urlparse(input_url)
    base_url = f"{parsed_input_url.scheme}://{parsed_input_url.netloc}"
    base_path = parsed_input_url.path  # This is /news
    # Case 1: If the partial URL starts with '?', keep the base path and add the query
    if partial_url.startswith('?'):
        parsed_partial = urlparse(partial_url)  # Parse the query string
        # Construct the new URL with the base path and query string
        return urlunparse(ParseResult(
            scheme=parsed_input_url.scheme,
            netloc=parsed_input_url.netloc,
            path=base_path,  # Keep the /news path
            params='',  # No parameters in this case
            query=parsed_partial.query,  # Query from the partial URL
            fragment=''  # No fragment
        ))

    # Case 2: For other relative or absolute paths, use urljoin
    return urljoin(base_url, partial_url)

# # Example: Find all <a> tags and get their href attributes
# for link in soup.find_all('a', href=True):
#     partial_url = link['href']
#     full_url = get_full_url(base_url, partial_url)
#     print(f"Partial: {partial_url} => Full: {full_url}")

partial_urls = ["?page=2", "/news/abc.html", "https://abc.com/news/abc.html"]
for partial_url in partial_urls:
    full_url = get_full_url(input_url, partial_url)
    print(f"Partial: {partial_url} => Full: {full_url}")
yulin0629 commented 2 weeks ago

Thank you for sharing your get_full_url implementation! I've done some thorough testing comparing it with Python's built-in urljoin, and I'd like to share my findings.

While your implementation works for many cases, I found some edge cases where it behaves differently from the standard urljoin, particularly with relative paths when the base URL ends with a slash. Here's what I discovered:

base_url = "https://example.com/news/"

# Case 1: Relative path
href = "relative/path.html?page=2"
urljoin:      "https://example.com/news/relative/path.html?page=2"
get_full_url: "https://example.com/relative/path.html?page=2"

# Case 2: Current directory path
href = "./current/path.html?page=2"
urljoin:      "https://example.com/news/current/path.html?page=2"
get_full_url: "https://example.com/current/path.html?page=2"

The main difference is that urljoin maintains the directory structure more accurately according to URL specifications. While your implementation works for your specific use cases, I would recommend using Python's built-in urljoin for the following reasons:

  1. It correctly handles all edge cases with relative paths
  2. It's part of the standard library and well-tested
  3. It follows URL specifications more strictly
  4. It's more maintainable as it's a standard solution
unclecode commented 2 weeks ago

@yulin0629 Your pull request is already merged and available in the current version. For the next version, consider using a strategy design pattern to allow end-users to pass their own normalized function, providing more flexibility. I believe it's challenging to create a universal solution, but users can easily create their own strategies, especially when considering specific details.