mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
17.81k stars 1.31k forks source link

[Bug/Investigation] Some relative paths are incomplete #821

Open rafaelsideguide opened 2 days ago

rafaelsideguide commented 2 days ago

When crawling https://docs.cleanlab.ai, it looks like the URLs are incomplete. For example, this URL in the response: https://docs.cleanlab.ai/cleanlab/token_classification/index.html is indeed a 404 page, but it should actually be: https://docs.cleanlab.ai/stable/cleanlab/token_classification/index.html.

This issue seems to occur because the elements on the page use relative paths like <a class="reference internal" href="../../tutorials/token_classification.html">Token Classification (text)</a>.

rafaelsideguide commented 2 days ago

When a page redirects, Firecrawl continues to use the base URL for generating links to crawl. We’ll need to adjust the handling for these redirect cases