mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
14.41k stars 1.05k forks source link

discussion: How to handle base url redirects on /crawl? #311

Open nickscamara opened 2 months ago

nickscamara commented 2 months ago

In very specific cases, the base url website that the user inputs will redirect to a new website containing a new base url. How should we handle that?

Example input /crawl: new.abb.com/sustainability/foundation -> redirects to -> https://global.abb/group/en/sustainability

snippet commented 2 months ago

I faced this situation today @nickscamara

I will share the specific case that I am currently working on #336

Crawling through Frameworks/Libs documentation

In this case, its important to retrieve the content of explicit links from the external redirections. (not is necessary to go deep through the external website)

Here is an example from docs.expo.dev : image this part of the doc mentions the New Architecture of React Native linked to external React Native docs, if the crawler doesn't go to this external page the Agent may be missing an important piece of content.

My solution:

Usage:

Examples: allowExternalContentLinks disabled

{
  "jobData": {
    "url": "https://www.mendable.ai",
    "mode": "crawl",
    "crawlerOptions": {
      "returnOnlyUrls": true,
      "ignoreSitemap": true,
      "excludes": [
        "/blog/*"
      ]
    }
  },
  "returnValue": [
    {
      "url": "https://www.mendable.ai"
    },
    {
      "url": "https://www.mendable.ai/"
    },
    {
      "url": "https://www.mendable.ai/pricing"
    },
    {
      "url": "https://www.mendable.ai/signup"
    },
    {
      "url": "https://www.mendable.ai/usecases/sales-enablement"
    },
    {
      "url": "https://www.mendable.ai/usecases/documentation"
    },
    {
      "url": "https://www.mendable.ai/usecases/cs-enablement"
    },
    {
      "url": "https://www.mendable.ai/usecases/productcopilot"
    },
    {
      "url": "https://www.mendable.ai/security"
    },
    {
      "url": "https://www.mendable.ai/privacy-policy"
    },
    {
      "url": "https://www.mendable.ai/terms-of-conditions"
    },
    {
      "url": "https://mendable.ai"
    }
  ]
}

allowExternalContentLinks enabled excluding content from 0x.org

{
  "jobData": {
    "url": "https://www.mendable.ai",
    "mode": "crawl",
    "crawlerOptions": {
      "returnOnlyUrls": true,
      "ignoreSitemap": true,
      "excludes": [
        "/blog/*",
        "0x.org"
      ],
      "allowExternalContentLinks": true
    }
  },
  "returnValue": [
    {
      "url": "https://www.mendable.ai"
    },
    {
      "url": "https://www.mendable.ai/"
    },
    {
      "url": "https://www.mendable.ai/pricing"
    },
    {
      "url": "https://www.mendable.ai/signin"
    },
    {
      "url": "https://www.mendable.ai/signup"
    },
    {
      "url": "https://www.mendable.ai/usecases/documentation"
    },
    {
      "url": "https://www.mendable.ai/usecases/cs-enablement"
    },
    {
      "url": "https://www.mendable.ai/usecases/sales-enablement"
    },
    {
      "url": "https://www.mendable.ai/usecases/productcopilot"
    },
    {
      "url": "https://mendable.wolfia.com/?ref=mendable-website"
    },
    {
      "url": "https://mendable.ai"
    },
    {
      "url": "https://docs.mendable.ai/integrations/slack"
    },
    {
      "url": "https://docs.mendable.ai/examples"
    },
    {
      "url": "https://docs.mendable.ai/tools"
    },
    {
      "url": "https://docs.mendable.ai/changelog"
    },
    {
      "url": "https://www.mendable.ai/security"
    },
    {
      "url": "https://www.mendable.ai/privacy-policy"
    },
    {
      "url": "https://www.mendable.ai/terms-of-conditions"
    },
    {
      "url": "https://www.dropbox.com/scl/fi/d6zofma4c1d9nq7sgjwhx/Mendable_Sales-Enablement_Case-Study.pdf?rlkey=819zj7zi0rjakjc0c2p255k5b&dl=0"
    }
  ]
}
rafaelsideguide commented 2 months ago

Hey @snippet I just reviewed your PR and this feature is awesome (especially when used with ignoreSitemap), but I don't think it closes this issue because the problem here is with pages that have redirects (like new.abb.com/sustainability/foundation which redirects to https://global.abb/group/en/sustainability).

snippet commented 2 months ago

Hey @snippet I just reviewed your PR and this feature is awesome (especially when used with ignoreSitemap), but I don't think it closes this issue because the problem here is with pages that have redirects (like new.abb.com/sustainability/foundation which redirects to https://global.abb/group/en/sustainability).

Thanks!

I mentioned my approach to scraping external links in this discussion because it could also be an alternative fix for the problem mentioned. When new.abb.com/sustainability/foundation redirects to global.abb/group/en/sustainability, my point of view is that one of the interesting approaches is to scrape only the specific link and be able to exclude sub-domains through an exclude parameter.

yurilaguardia commented 2 months ago

Maybe the start method of the WebCrawler should first make an initial request to the 'candidate' initialUrl.

If the response URL is different, update initialUrl (maybe based on some new property like "followRedirects" that could be added as an option to the crawler constructor) before crawling it. Also, update robotsTxtUrl with the new initialUrl before using it for the get request.

nickscamara commented 2 months ago

@yurilaguardia Great idea!

rafaelsideguide commented 1 month ago

Closed by #389