Open nickscamara opened 2 months ago
I faced this situation today @nickscamara
I will share the specific case that I am currently working on #336
In this case, its important to retrieve the content of explicit links from the external redirections. (not is necessary to go deep through the external website)
Here is an example from docs.expo.dev :
this part of the doc mentions the New Architecture
of React Native linked to external React Native docs, if the crawler doesn't go to this external page the Agent may be missing an important piece of content.
My solution:
Usage:
crawlerOptions
param called allowExternalContentLinks
as a booleanexcludes
param to accept external domains, such as ["/blog/*","ycombinator.com", ","0x.org"]
Examples: allowExternalContentLinks disabled
{
"jobData": {
"url": "https://www.mendable.ai",
"mode": "crawl",
"crawlerOptions": {
"returnOnlyUrls": true,
"ignoreSitemap": true,
"excludes": [
"/blog/*"
]
}
},
"returnValue": [
{
"url": "https://www.mendable.ai"
},
{
"url": "https://www.mendable.ai/"
},
{
"url": "https://www.mendable.ai/pricing"
},
{
"url": "https://www.mendable.ai/signup"
},
{
"url": "https://www.mendable.ai/usecases/sales-enablement"
},
{
"url": "https://www.mendable.ai/usecases/documentation"
},
{
"url": "https://www.mendable.ai/usecases/cs-enablement"
},
{
"url": "https://www.mendable.ai/usecases/productcopilot"
},
{
"url": "https://www.mendable.ai/security"
},
{
"url": "https://www.mendable.ai/privacy-policy"
},
{
"url": "https://www.mendable.ai/terms-of-conditions"
},
{
"url": "https://mendable.ai"
}
]
}
allowExternalContentLinks enabled excluding content from 0x.org
{
"jobData": {
"url": "https://www.mendable.ai",
"mode": "crawl",
"crawlerOptions": {
"returnOnlyUrls": true,
"ignoreSitemap": true,
"excludes": [
"/blog/*",
"0x.org"
],
"allowExternalContentLinks": true
}
},
"returnValue": [
{
"url": "https://www.mendable.ai"
},
{
"url": "https://www.mendable.ai/"
},
{
"url": "https://www.mendable.ai/pricing"
},
{
"url": "https://www.mendable.ai/signin"
},
{
"url": "https://www.mendable.ai/signup"
},
{
"url": "https://www.mendable.ai/usecases/documentation"
},
{
"url": "https://www.mendable.ai/usecases/cs-enablement"
},
{
"url": "https://www.mendable.ai/usecases/sales-enablement"
},
{
"url": "https://www.mendable.ai/usecases/productcopilot"
},
{
"url": "https://mendable.wolfia.com/?ref=mendable-website"
},
{
"url": "https://mendable.ai"
},
{
"url": "https://docs.mendable.ai/integrations/slack"
},
{
"url": "https://docs.mendable.ai/examples"
},
{
"url": "https://docs.mendable.ai/tools"
},
{
"url": "https://docs.mendable.ai/changelog"
},
{
"url": "https://www.mendable.ai/security"
},
{
"url": "https://www.mendable.ai/privacy-policy"
},
{
"url": "https://www.mendable.ai/terms-of-conditions"
},
{
"url": "https://www.dropbox.com/scl/fi/d6zofma4c1d9nq7sgjwhx/Mendable_Sales-Enablement_Case-Study.pdf?rlkey=819zj7zi0rjakjc0c2p255k5b&dl=0"
}
]
}
Hey @snippet I just reviewed your PR and this feature is awesome (especially when used with ignoreSitemap
), but I don't think it closes this issue because the problem here is with pages that have redirects (like new.abb.com/sustainability/foundation
which redirects to https://global.abb/group/en/sustainability
).
Hey @snippet I just reviewed your PR and this feature is awesome (especially when used with
ignoreSitemap
), but I don't think it closes this issue because the problem here is with pages that have redirects (likenew.abb.com/sustainability/foundation
which redirects tohttps://global.abb/group/en/sustainability
).
Thanks!
I mentioned my approach to scraping external links in this discussion because it could also be an alternative fix for the problem mentioned. When new.abb.com/sustainability/foundation
redirects to global.abb/group/en/sustainability
, my point of view is that one of the interesting approaches is to scrape only the specific link and be able to exclude sub-domains through an exclude parameter.
Maybe the start
method of the WebCrawler
should first make an initial request to the 'candidate' initialUrl.
If the response URL is different, update initialUrl (maybe based on some new property like "followRedirects" that could be added as an option to the crawler constructor) before crawling it. Also, update robotsTxtUrl with the new initialUrl before using it for the get
request.
@yurilaguardia Great idea!
Closed by #389
In very specific cases, the base url website that the user inputs will redirect to a new website containing a new base url. How should we handle that?
Example input /crawl: new.abb.com/sustainability/foundation -> redirects to -> https://global.abb/group/en/sustainability