unclecode / crawl4ai

🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
Apache License 2.0
16.39k stars 1.2k forks source link

Bug with Internal Links Extraction (Relative Links) #224

Closed milukyna closed 2 weeks ago

milukyna commented 2 weeks ago

Hi,

First of all thank you for your amazing work on this project! As I was using the tool, I found that in the newest version (0.3.72), the internal logic to extract internal links seem not to work with relative paths.

What is happening

Example code

test_url = "https://www.some_url.com/English/index.html"
async with AsyncWebCrawler(verbose=True) as crawler:
    result = await crawler.arun(url=test_url, magic=True)
print(result["links"]["internal"] 

Problem

Suppose there is a relative internal link such as "blog/index.html". The above code will give

[{'href': 'https://www.some_url.com/blog/index.html',
  'text': 'Original version',
  'title': ''},

While we expect the following:

[{'href': 'https://www.some_url.com/English/blog/index.html',
  'text': 'English version',
  'title': ''},

Bug Origin

I believe the problem arises form normalize_url in utils.py

def normalize_url(href, base_url):
    """Normalize URLs to ensure consistent format"""
    # Extract protocol and domain from base URL
    try:
        base_parts = base_url.split('/')
        protocol = base_parts[0]
        domain = base_parts[2]
    except IndexError:
        raise ValueError(f"Invalid base URL format: {base_url}")

# (...)

    # Handle relative URLs
    if not href.startswith(('http://', 'https://')):
        # Remove leading './' if present
        href = href.lstrip('./')
        return f"{protocol}//{domain}/{href}"  # Here is the problem

    return href.strip()

In the above code domain would corresponds to some_url.com while in the particular case of relative URLS, we want to keep some_url.com/English.

Fix

Maybe it would be cleaner to use urllibthat is specially designed in handling such situation.

def normalize_url(href, base_url):
    """Normalize URLs to ensure consistent format"""
    from urllib.parse import urljoin, urlparse

    # Parse base URL to get components
    parsed_base = urlparse(base_url)
    if not parsed_base.scheme or not parsed_base.netloc:
        raise ValueError(f"Invalid base URL format: {base_url}")

    # Use urljoin to handle all cases
    normalized = urljoin(base_url, href.strip())
    return normalized
unclecode commented 2 weeks ago

@milukyna Thank you so much for the suggestion. I totally agree with that. I will apply the fix. Again, I appreciate you for using the library and finding the bug and helping us to fix it. If you are interested, let me know your email address; I can invite you to our Discord channel to help us. Thank you so much.