discussion: How to handle base url redirects on /crawl?

nickscamara commented 2 months ago

In very specific cases, the base url website that the user inputs will redirect to a new website containing a new base url. How should we handle that?

Example input /crawl: new.abb.com/sustainability/foundation -> redirects to -> https://global.abb/group/en/sustainability

snippet commented 2 months ago

I faced this situation today @nickscamara

I will share the specific case that I am currently working on #336

Crawling through Frameworks/Libs documentation

In this case, its important to retrieve the content of explicit links from the external redirections. (not is necessary to go deep through the external website)

Here is an example from docs.expo.dev : this part of the doc mentions the New Architecture of React Native linked to external React Native docs, if the crawler doesn't go to this external page the Agent may be missing an important piece of content.

My solution:

Created a handler for external links
be sure that the crawler does not go through external main pages (it avoids our data being a big messy in most cases)
Avoided the crawler read link three from the external page

Usage:

Added a new crawlerOptions param called allowExternalContentLinks as a boolean
Implemented a handle for excludes param to accept external domains, such as ["/blog/*","ycombinator.com", ","0x.org"]

Examples: allowExternalContentLinks disabled

{
  "jobData": {
    "url": "https://www.mendable.ai",
    "mode": "crawl",
    "crawlerOptions": {
      "returnOnlyUrls": true,
      "ignoreSitemap": true,
      "excludes": [
        "/blog/*"
      ]
    }
  },
  "returnValue": [
    {
      "url": "https://www.mendable.ai"
    },
    {
      "url": "https://www.mendable.ai/"
    },
    {
      "url": "https://www.mendable.ai/pricing"
    },
    {
      "url": "https://www.mendable.ai/signup"
    },
    {
      "url": "https://www.mendable.ai/usecases/sales-enablement"
    },
    {
      "url": "https://www.mendable.ai/usecases/documentation"
    },
    {
      "url": "https://www.mendable.ai/usecases/cs-enablement"
    },
    {
      "url": "https://www.mendable.ai/usecases/productcopilot"
    },
    {
      "url": "https://www.mendable.ai/security"
    },
    {
      "url": "https://www.mendable.ai/privacy-policy"
    },
    {
      "url": "https://www.mendable.ai/terms-of-conditions"
    },
    {
      "url": "https://mendable.ai"
    }
  ]
}

allowExternalContentLinks enabled excluding content from 0x.org

{
  "jobData": {
    "url": "https://www.mendable.ai",
    "mode": "crawl",
    "crawlerOptions": {
      "returnOnlyUrls": true,
      "ignoreSitemap": true,
      "excludes": [
        "/blog/*",
        "0x.org"
      ],
      "allowExternalContentLinks": true
    }
  },
  "returnValue": [
    {
      "url": "https://www.mendable.ai"
    },
    {
      "url": "https://www.mendable.ai/"
    },
    {
      "url": "https://www.mendable.ai/pricing"
    },
    {
      "url": "https://www.mendable.ai/signin"
    },
    {
      "url": "https://www.mendable.ai/signup"
    },
    {
      "url": "https://www.mendable.ai/usecases/documentation"
    },
    {
      "url": "https://www.mendable.ai/usecases/cs-enablement"
    },
    {
      "url": "https://www.mendable.ai/usecases/sales-enablement"
    },
    {
      "url": "https://www.mendable.ai/usecases/productcopilot"
    },
    {
      "url": "https://mendable.wolfia.com/?ref=mendable-website"
    },
    {
      "url": "https://mendable.ai"
    },
    {
      "url": "https://docs.mendable.ai/integrations/slack"
    },
    {
      "url": "https://docs.mendable.ai/examples"
    },
    {
      "url": "https://docs.mendable.ai/tools"
    },
    {
      "url": "https://docs.mendable.ai/changelog"
    },
    {
      "url": "https://www.mendable.ai/security"
    },
    {
      "url": "https://www.mendable.ai/privacy-policy"
    },
    {
      "url": "https://www.mendable.ai/terms-of-conditions"
    },
    {
      "url": "https://www.dropbox.com/scl/fi/d6zofma4c1d9nq7sgjwhx/Mendable_Sales-Enablement_Case-Study.pdf?rlkey=819zj7zi0rjakjc0c2p255k5b&dl=0"
    }
  ]
}

rafaelsideguide commented 2 months ago

Hey @snippet I just reviewed your PR and this feature is awesome (especially when used with ignoreSitemap), but I don't think it closes this issue because the problem here is with pages that have redirects (like new.abb.com/sustainability/foundation which redirects to https://global.abb/group/en/sustainability).

snippet commented 2 months ago

Hey @snippet I just reviewed your PR and this feature is awesome (especially when used with ignoreSitemap), but I don't think it closes this issue because the problem here is with pages that have redirects (like new.abb.com/sustainability/foundation which redirects to https://global.abb/group/en/sustainability).

Thanks!

I mentioned my approach to scraping external links in this discussion because it could also be an alternative fix for the problem mentioned. When new.abb.com/sustainability/foundation redirects to global.abb/group/en/sustainability, my point of view is that one of the interesting approaches is to scrape only the specific link and be able to exclude sub-domains through an exclude parameter.

yurilaguardia commented 2 months ago

Maybe the start method of the WebCrawler should first make an initial request to the 'candidate' initialUrl.

If the response URL is different, update initialUrl (maybe based on some new property like "followRedirects" that could be added as an option to the crawler constructor) before crawling it. Also, update robotsTxtUrl with the new initialUrl before using it for the get request.

nickscamara commented 2 months ago

@yurilaguardia Great idea!

rafaelsideguide commented 1 month ago

Closed by #389

mendableai / firecrawl

discussion: How to handle base url redirects on /crawl? #311

Crawling through Frameworks/Libs documentation