fix(crawler): relative URL handling on non-start pages

mogery commented 2 weeks ago

Fixes #821

Went for an easier fix. Just fixes the logic when adding relative URLs to the crawl from the site content. Was basing the new URL off of the wrong base URL.

rafaelsideguide commented 2 weeks ago

Hey @mogery, unfortunately, this doesn't fix the bug.

For the following example:

POST http://localhost:3002/v1/crawl HTTP/1.1
Authorization: Bearer fc-redacted
content-type: application/json

{
  "url": "https://docs.cleanlab.ai",
  "allowBackwardLinks": true
}

One of the pages I was expecting to find is https://docs.cleanlab.ai/stable/cleanlab/multilabel_classification/rank.html, but in the results, the crawler only retrieved the non-redirected URL https://docs.cleanlab.ai/cleanlab/multilabel_classification/rank.html (without /stable), which leads to a 404:

Captura de Tela 2024-11-12 às 09 31 48

The base URL https://docs.cleanlab.ai redirects to https://docs.cleanlab.ai/stable/index.html through a non-DNS-based redirect (which we only catch after the first page response). This is causing the 404s.

mogery commented 2 weeks ago

My bad, forgot about that case. Will be fixing after standup.

mogery commented 2 weeks ago

@rafaelsideguide should work now, retest pls

rafaelsideguide commented 2 weeks ago

Looking good. 1354 pages crawled for this url now (no 404s apparently :D) let's merge it!

mendableai / firecrawl

fix(crawler): relative URL handling on non-start pages #893