Closed sebs closed 1 year ago
Hi @sebs, not at the moment. It would be a nice feature to have. Some companies like Disney have their main domain as the root page, while having every link that they care about treated as a different dns name on the page. It makes it hard to gather all the website data with this pattern.
@sebs this is now available in 1.42.0.
Thank you for the issue!
ah i love this so much ;) you are solving a big problem for me. Trying to build a url dataset of 20 million for a coding challenge ;)
I do really aprechiate this as it saves me a ton of time.
@sebs no worries at all, feel free to keep reporting issues even if it is just a simple question! This feature is something that I wanted for awhile too since this project is the main engine for collecting data across a couple things I use.
i did not find the option for external domains in the cli version of spider. Maybe the change did not make it through?
@sebs at the moment not available in the CLI. Not all features go 1:1, if they fit the CLI they also need to be added separately. Going to re-open this issue for the CLI.
Now available in the CLI v1.45.10. Example below to group domains.
spider --domain https://rsseau.fr -E https://loto.rsseau.fr/ crawl -o
.
The E
flag can also be written as external-domains
.
maybe make it possible to add a * to extract all external domain links?
Background: one thing I am using the tool for is to create link maps ... aka page a links to page b
@sebs done via 1.46.0
. Thank you!
<3
Hi, Perhaps I'm using this incorrectly, but when I try the following command, using spider_cli 1.80.78:
spider -t -v --url https://www.theconsortium.cloud/ --depth 10 -s -E https://39834791.fs1.hubspotusercontent-na1.net/ scrape
I never see any URLs from the external domain, even though on one of the pages crawled https://www.theconsortium.cloud/application-consulting-services-page there is a button that links to a pdf on Hubspot, the HTML looks like this:
Download our one-pager for more information
the output from the scape command looks like this for that page: { "html": "", "links": [], "url": "https://www.theconsortium.cloud/application-consulting-services-page" },
Is there a way, either programmatically or via the CLI, to have a spider detect all of the links on a page? Thanks in advance.
How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.
How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.
Use website.external_domains to add domains into the group for discovery.
I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall
I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall
Set website.external_domains to a wildcard. If this isn't a thing yet, I can add it in later.
I don't think it is a thing.
I don't think it is a thing.
https://github.com/spider-rs/spider/issues/135#issuecomment-1733947549 looks like it was done. Use website.with_external_domains
.
asks me to provide an argument
asks me to provide an argument
Correct, follow the type for the function. Set the value to a wildcard. Not sure what IDE that is, rust-analyzer is almost a must when using any crate.
I used this
.with_external_domains(Some(vec!["*"].into_iter().map(|s| s.to_string())))
and this:
.with_external_domains(Some(std::iter::once("*".to_string())));
this compiles just fine but doesn't give me any external link from the site
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://carboxi.de");
website.with_respect_robots_txt(true)
.with_subdomains(true)
.with_external_domains(Some(std::iter::once("*".to_string())));
website.crawl().await;
let links = website.get_links();
let url = website.get_url().inner();
let status = website.get_status();
println!("URL: {:?}", url);
println!("Status: {:?}\n", status);
for link in links {
println!("{:?}", link.as_ref());
}
}
i dont think i understand what the wildcard for this is
I only get 'internal links'.
Is there a way to get external links too?