Add better documentation to get started

mgifford commented 1 year ago

It is useful to tell folks what they need to do, probably starting with:

% git clone https://github.com/spider-rs/spider.git 
cd spider

Easy to do get some results by: cargo run --example example

Harder to get this working spider [OPTIONS] --domain <DOMAIN> [SUBCOMMAND]

Where is the --help to give me the [OPTIONS] I'm looking for. How about a [SUBCOMMAND]?

Is with or without the https://? Does it matter?

Having a general INSTALL.txt file is always helpful.

When you are able to get a spider to work, where does the data go?

I can get cargo run --example example to scan https://rsseau.fr as configured in the example.rs file, but not sure how to customize that? I should be able to just copy the example.rs file and run something that points to that config, but not sure what that is.

This is all good info to put in an INSTALL.txt file.

j-mendez commented 1 year ago

@mgifford adding the git clone intro into the example section is a good add. For the rust examples it requires using Rust to extract the data at runtime ex: https://github.com/spider-rs/spider/blob/main/examples/example.rs#L21.

The CLI comes with spider --help in order to print the commands. By default if the subcommand for CRAWL is declared without a transport type it will run https.

I realized that the examples do not highlight the callback options upfront in the central spider Rust repo.

Here is an example of getting data from the crawler at runtime:

extern crate spider;

use spider::website::Website;
use spider::tokio;

#[tokio::main]
async fn main() {
    let url = "https://choosealicense.com";
    let mut website: Website = Website::new(&url);
    website.on_link_find_callback = |s| {
         // Callback to run on each link found: do custom logic here
         println!("link target: {}", s); 
         s 
    };
    website.crawl().await;

    for page in website.get_pages() {
        println!("- {}", page.get_html());
    }
}

Adding custom logic to on_link_find_callback is the main focus to get the data realtime. Here is a link to the crate docs.

j-mendez commented 1 year ago

Examples improved https://github.com/spider-rs/spider/commit/dfec8fe896c3d0d3410da8a6611334e51e45094b, thanks!

mgifford commented 1 year ago

Looking at the example, I wonder if more inline docs would be useful. This is just from ChatGPT - looks like overkill to me, but sometimes it is easier to read the comments than scan through the variable names for clues.

extern crate spider;  // Bring the `spider` crate into scope

use spider::tokio;  // Bring the `tokio` module from the `spider` crate into scope
use spider::website::Website;  // Bring the `Website` struct from the `website` module of the `spider` crate into scope

#[tokio::main]  // Indicate that this function is the entry point for a tokio-based Rust program
async fn main() {
    // Create a new `Website` struct and store it in the `website` variable
    let mut website: Website = Website::new("https://rsseau.fr");

    // Add a URL to the blacklist
    website.configuration.blacklist_url.push("https://rsseau.fr/resume".to_string());
    // Set whether to respect the `robots.txt` file
    website.configuration.respect_robots_txt = true;
    // Set whether to crawl subdomains
    website.configuration.subdomains = false;
    // Set the delay between requests (defaults to 250 ms)
    website.configuration.delay = 15;
    // Set the user agent string (defaults to "spider/x.y.z", where x.y.z is the library version)
    website.configuration.user_agent = "SpiderBot".into();

    // Crawl the website
    website.crawl().await;

    // Print out the URLs of all the pages that were crawled
    for page in website.get_pages() {
        println!("- {}", page.get_url());
    }
}

mgifford commented 1 year ago

I'm having more primitive "Hello World" issues (again), and I think ChatGPT might be useful there too:

To execute this spider from the command line, you will need to use the cargo command, which is the package manager for Rust. You can run the spider by using the following command:

cargo run --example example _This will build and run the spider, using the example example specified in the Cargo.toml file.

If you want to build the spider without running it, you can use the cargo build command instead of cargo run.

Alternatively, if you want to build the spider in release mode (which can result in faster execution times), you can use the cargo build --release command.

Keep in mind that you will need to have the spider crate installed in your Rust project in order to be able to build and run the spider. You can install the spider crate by adding it as a dependency in your Cargo.toml file, or by running the following command:_

cargo install spider

I still can't seem to run spider --help so again looked to see if ChatGPT could fill in some of the gaps in my knowledge.

_The spider command is not a standalone command that you can run from the command line. Instead, it is a crate (library) that you can use in your own Rust projects.

To use the spider crate in your own Rust project, you will need to add it as a dependency in your Cargo.toml file, like this:_

[dependencies]
spider = "0.7"

Then, you can include the spider crate in your Rust code using the extern crate directive, like this:

extern crate spider;

_Once you have done this, you can use the functions and types provided by the spider crate in your Rust code.

If you want to see the documentation for the spider crate, you can use the cargo doc command to generate it, or you can view it online at https://docs.rs/spider._

That all sounded good, but I'm still not getting results.

% spider --help
zsh: command not found: spider

j-mendez commented 1 year ago

@mgifford Hi for the spider installation it is actually cargo install spider_cli to install.

mgifford commented 1 year ago

Ahh.. I think this is where you know too much & I know too little. Thanks for your support on this.

And to crawl a site, it has to be in a format like this: % spider -d https://example.com -r crawl

That's the order of the command that is needed — I tried % spider crawl -d https://example.com -r -v

That would be useful for the documentation.

mgifford commented 1 year ago

And I'd need to use % spider -d https://example.com -r crawl > example.com.txt to dump the results to a file that could then be imported into a spreadsheet or whatever.

This project has no way to do this, or to modify the output.

j-mendez commented 1 year ago

No problem, the output option is with the -o flag ex:

‘spider --domain https://choosealicense.com crawl -o > spider_choosealicense.json’

the CLI is setup for STDout as the primary output at the moment.

spider-rs / spider

Add better documentation to get started #78