spider-rs / spider

A web crawler and scraper for Rust
https://spider.cloud
MIT License
1.16k stars 101 forks source link

Prebuilt binaries for Linux, macOS #183

Closed Zabrane closed 5 months ago

Zabrane commented 6 months ago

Hi guys,

This project is simply awesome. I've tried a bunch of spiders in Go and NodeJS but they are slow. However, I'm getting an error when trying to compile it on my macOS Intel:

$ rustc --version
rustc 1.78.0 (9b00956e5 2024-04-29) (Homebrew)

$ cargo --version
cargo 1.78.0

$ cargo install spider_cli
...
Commit: c | Stash: s | Reset: D | Keybindings: ? | Cancel: <esc>                                                                                                   Donate Ask Question 0.42.0
   Compiling lock_api v0.4.12
   Compiling bitflags v2.5.0
   Compiling fnv v1.0.7
   Compiling memchr v2.7.2
   Compiling slab v0.4.9
error: failed to run custom build command for `libc v0.2.155`

Caused by:
  process didn't exit successfully: `/var/folders/vs/bzcnt6t54q36jqxt9_ym3yrm0000gn/T/cargo-installxHswwe/release/build/libc-112416c9683a52c9/build-script-build` (signal: 9, SIGKILL: kill)
warning: build failed, waiting for other jobs to finish...
error: failed to run custom build command for `proc-macro2 v1.0.84`

Caused by:
  process didn't exit successfully: `/var/folders/vs/bzcnt6t54q36jqxt9_ym3yrm0000gn/T/cargo-installxHswwe/release/build/proc-macro2-be1b5d01b784b9e6/build-script-build` (signal: 6, SIGABRT: process abort signal)
  --- stderr
  dyld[21698]: section __got overflows indirect symbol table
error: failed to run custom build command for `serde v1.0.203`

Caused by:
  process didn't exit successfully: `/var/folders/vs/bzcnt6t54q36jqxt9_ym3yrm0000gn/T/cargo-installxHswwe/release/build/serde-a3697e463548b6b8/build-script-build` (signal: 9, SIGKILL: kill)
error: failed to run custom build command for `libc v0.2.155`

Caused by:
  process didn't exit successfully: `/var/folders/vs/bzcnt6t54q36jqxt9_ym3yrm0000gn/T/cargo-installxHswwe/release/build/libc-960b6c8ecff84b0a/build-script-build` (signal: 9, SIGKILL: kill)
error: failed to run custom build command for `parking_lot_core v0.9.10`

Caused by:
  process didn't exit successfully: `/var/folders/vs/bzcnt6t54q36jqxt9_ym3yrm0000gn/T/cargo-installxHswwe/release/build/parking_lot_core-9ad02810495e83a2/build-script-build` (signal: 9, SIGKILL: kill)
error: failed to compile `spider_cli v1.95.14`, intermediate artifacts can be found at `/var/folders/vs/bzcnt6t54q36jqxt9_ym3yrm0000gn/T/cargo-installxHswwe`.
To reuse those artifacts with a future compilation, set the environment variable `CARGO_TARGET_DIR` to that path.

Would it be possible to prepare pre-built binaries to simplify its usage?

I succeeded building it from source on Ubuntu 22.04. But it doesn't seem to work:

$ spider --version
spider_cli 1.95.14

$ spider -a --verbose --url "https://rsseau.fr"
$

Many thanks

j-mendez commented 6 months ago

@Zabrane could you try v1.95.19 and see if the issues persist? For pre-built binaries it is possible. This crate has a lot of features so using a pre-build binary would require enabling every feature and making sure their is a call for each feature workflow difference across runs. This would need a lot of documentation. Feel free to push a PR.

j-mendez commented 6 months ago

@Zabrane could you try v1.95.19 and see if the issues persist? For pre-built binaries it is possible. This crate has a lot of features so using a pre-build binary would require enabling every feature and making sure their is a call for each feature workflow difference across runs. This would need a lot of documentation. Feel free to push a PR.

https://github.com/axodotdev/cargo-dist should be able to handle the reqs for the deployment.

Zabrane commented 6 months ago

@j-mendez testing now.

Zabrane commented 6 months ago

@j-mendez could help please?

❯ git checkout v1.95.19
HEAD is now at 4b8a604 chore(deps): bump chromiumoxide@0.6.0

❯ cargo install spider_cli
    Updating crates.io index
     Ignored package `spider_cli v1.95.22` is already installed, use --force to override

❯ ~/.cargo/bin/spider --version
spider_cli 1.95.22

❯ ~/.cargo/bin/spider -a --verbose --url "https://rsseau.fr"
❯

Why a checkout of v1.95.19 is building a newer version 1.95.22? As you can see, it still doesn't work.

j-mendez commented 6 months ago

@j-mendez could help please?

❯ git checkout v1.95.19
HEAD is now at 4b8a604 chore(deps): bump chromiumoxide@0.6.0

❯ cargo install spider_cli
    Updating crates.io index
     Ignored package `spider_cli v1.95.22` is already installed, use --force to override

❯ ~/.cargo/bin/spider --version
spider_cli 1.95.22

❯ ~/.cargo/bin/spider -a --verbose --url "https://rsseau.fr"
❯

Why a checkout of v1.95.19 is building a newer version 1.95.22? As you can see, it still doesn't work.

You need a command after the url. Here is the list, that can be found using spider --help.

The fastest web crawler CLI written in Rust.

Usage: spider [OPTIONS] --url <URL> [COMMAND]

Commands:
  crawl     Crawl the website extracting links
  scrape    Scrape the website extracting html and links
  download  Download html markup to destination
  help      Print this message or the help of the given subcommand(s)

example of the spider cli running

j-mendez commented 6 months ago

Accidental close.

Zabrane commented 6 months ago

@j-mendez why the --limit <LIMIT> isn't respected? If for example i run a crawl with --limit 5, it keeps running forever.

Similarly, --depth 1 get me the first page, but --depth 2 runs forever.

j-mendez commented 6 months ago

screenshot of the spider cli respecting the crawl limits fixed in v1.95.23 Thanks again.

Zabrane commented 6 months ago

@j-mendez I don't know which version you're using, but neither --limit nor --depth is working as expected :-/

> spider --version
spider_cli 1.95.22

> ❯ spider --verbose --limit 3 --url "https://choosealicense.com/" crawl
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/terms-of-service/
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/about/
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/mit/
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/non-software/
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/community/
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/no-permission/
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/unlicense/
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/isc/
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/appendix/
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/isc
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/appendix
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/unlicense
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/mit-0
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/zlib
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/ms-pl
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/bsd-2-clause/
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/bsd-3-clause-clear
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/ncsa
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/bsd-2-clause
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/mit
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/bsd-2-clause-patent
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/bsd-4-clause
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/vim
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/ms-rl
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/postgresql
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/bsd-3-clause
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/wtfpl
[2024-05-30T05:19:09Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/0bsd
[2024-05-30T05:19:10Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/ms-pl/
[2024-05-30T05:19:10Z INFO  spider::utils] fetch - https://choosealicense.com/licenses/bsd-3-clause/
[...] continues forever !!!!
Zabrane commented 5 months ago

Working now

$ spider --version
spider_cli 1.95.23
$ spider --verbose --limit 3 --url "https://choosealicense.com/" crawl
[2024-05-30T07:46:59Z INFO  spider::utils] fetch - https://choosealicense.com/
[2024-05-30T07:46:59Z INFO  spider::utils] fetch - https://choosealicense.com/community/
[2024-05-30T07:46:59Z INFO  spider::utils] fetch - https://choosealicense.com/non-software/
j-mendez commented 5 months ago

Some artifacts are available here https://github.com/spider-rs/spider/releases/tag/v1.95.27. The next release will have musl support.