Closed mogery closed 2 weeks ago
Hey @mogery, unfortunately, this doesn't fix the bug.
For the following example:
POST http://localhost:3002/v1/crawl HTTP/1.1
Authorization: Bearer fc-redacted
content-type: application/json
{
"url": "https://docs.cleanlab.ai",
"allowBackwardLinks": true
}
One of the pages I was expecting to find is https://docs.cleanlab.ai/stable/cleanlab/multilabel_classification/rank.html
, but in the results, the crawler only retrieved the non-redirected URL https://docs.cleanlab.ai/cleanlab/multilabel_classification/rank.html
(without /stable
), which leads to a 404:
The base URL https://docs.cleanlab.ai
redirects to https://docs.cleanlab.ai/stable/index.html
through a non-DNS-based redirect (which we only catch after the first page response). This is causing the 404s.
My bad, forgot about that case. Will be fixing after standup.
@rafaelsideguide should work now, retest pls
Looking good. 1354 pages crawled for this url now (no 404s apparently :D) let's merge it!
Fixes #821
Went for an easier fix. Just fixes the logic when adding relative URLs to the crawl from the site content. Was basing the
new URL
off of the wrong base URL.