spider-rs / spider

A web crawler and scraper for Rust
https://spider.cloud
MIT License
1.03k stars 93 forks source link

Also extract urls that are pointing to other domains? [CLI] #135

Closed sebs closed 1 year ago

sebs commented 1 year ago

I only get 'internal links'.

Is there a way to get external links too?

j-mendez commented 1 year ago

Hi @sebs, not at the moment. It would be a nice feature to have. Some companies like Disney have their main domain as the root page, while having every link that they care about treated as a different dns name on the page. It makes it hard to gather all the website data with this pattern.

j-mendez commented 1 year ago

@sebs this is now available in 1.42.0.

crawling multiple domains as one for the url https://rssea.fr and https://loto.rsseau.fr

Thank you for the issue!

sebs commented 1 year ago

ah i love this so much ;) you are solving a big problem for me. Trying to build a url dataset of 20 million for a coding challenge ;)

I do really aprechiate this as it saves me a ton of time.

j-mendez commented 1 year ago

@sebs no worries at all, feel free to keep reporting issues even if it is just a simple question! This feature is something that I wanted for awhile too since this project is the main engine for collecting data across a couple things I use.

sebs commented 1 year ago

i did not find the option for external domains in the cli version of spider. Maybe the change did not make it through?

j-mendez commented 1 year ago

@sebs at the moment not available in the CLI. Not all features go 1:1, if they fit the CLI they also need to be added separately. Going to re-open this issue for the CLI.

j-mendez commented 1 year ago

Now available in the CLI v1.45.10. Example below to group domains.

spider --domain https://rsseau.fr -E https://loto.rsseau.fr/ crawl -o.

The E flag can also be written as external-domains.

sebs commented 1 year ago

maybe make it possible to add a * to extract all external domain links?

Background: one thing I am using the tool for is to create link maps ... aka page a links to page b

j-mendez commented 1 year ago

@sebs done via 1.46.0. Thank you!

sebs commented 1 year ago

<3

apsaltis commented 8 months ago

Hi, Perhaps I'm using this incorrectly, but when I try the following command, using spider_cli 1.80.78:

spider -t -v --url https://www.theconsortium.cloud/ --depth 10 -s -E https://39834791.fs1.hubspotusercontent-na1.net/ scrape

I never see any URLs from the external domain, even though on one of the pages crawled https://www.theconsortium.cloud/application-consulting-services-page there is a button that links to a pdf on Hubspot, the HTML looks like this:

Download our one-pager for more information

the output from the scape command looks like this for that page: { "html": "", "links": [], "url": "https://www.theconsortium.cloud/application-consulting-services-page" },

Is there a way, either programmatically or via the CLI, to have a spider detect all of the links on a page? Thanks in advance.

scientiac commented 5 months ago

How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.

j-mendez commented 5 months ago

How do I extract the URLs pointing to other domains? using the crate not the cli.. Trying to make a crawler with self discovery of new sites from one seed.

Use website.external_domains to add domains into the group for discovery.

scientiac commented 5 months ago

I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall

j-mendez commented 5 months ago

I mean to catch all the websites that aren't under the same domain... not the ones i specify... like using -E * , a catchall

Set website.external_domains to a wildcard. If this isn't a thing yet, I can add it in later.

scientiac commented 5 months ago

I don't think it is a thing.

j-mendez commented 5 months ago

I don't think it is a thing.

CASELESS_WILD_CARD external domains handling

https://github.com/spider-rs/spider/issues/135#issuecomment-1733947549 looks like it was done. Use website.with_external_domains.

scientiac commented 5 months ago

image

asks me to provide an argument

j-mendez commented 5 months ago

image

asks me to provide an argument

Correct, follow the type for the function. Set the value to a wildcard. Not sure what IDE that is, rust-analyzer is almost a must when using any crate.

scientiac commented 5 months ago

I used this

        .with_external_domains(Some(vec!["*"].into_iter().map(|s| s.to_string())))

and this:

        .with_external_domains(Some(std::iter::once("*".to_string())));

this compiles just fine but doesn't give me any external link from the site

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://carboxi.de");

    website.with_respect_robots_txt(true)
        .with_subdomains(true)
        .with_external_domains(Some(std::iter::once("*".to_string())));

    website.crawl().await;

    let links = website.get_links();
    let url = website.get_url().inner();
    let status = website.get_status();

    println!("URL: {:?}", url);
    println!("Status: {:?}\n", status);

    for link in links {
        println!("{:?}", link.as_ref());
    }
}

i dont think i understand what the wildcard for this is