spider-rs / spider

A web crawler and scraper for Rust
https://spider.cloud
MIT License
1.16k stars 101 forks source link

support file:// urls #197

Closed jmikedupont2 closed 2 months ago

jmikedupont2 commented 3 months ago

I would like to spider starting with a local file of urls and then have it follow them, I found multiple hard coded parts of the program that have https checks, seems like duplicated code. please refactor and allow for more protocols.

j-mendez commented 3 months ago

I would like to spider starting with a local file of urls and then have it follow them, I found multiple hard coded parts of the program that have https checks, seems like duplicated code. please refactor and allow for more protocols.

Hi, can you show an example of this not working? That way someone does not have to figure out and look at what refactoring looks like.

jmikedupont2 commented 3 months ago

Using this file https://github.com/meta-introspector/time-grants/blob/main/2024/08/01/links.html

RUST_LOG=debug ./target/debug/spider --verbose --url file:///home/mdupont/2024/08/01/time-grants/2024/08/01/links.html crawl

j-mendez commented 3 months ago

Using this file

https://github.com/meta-introspector/time-grants/blob/main/2024/08/01/links.html

RUST_LOG=debug ./target/debug/spider --verbose --url file:///home/mdupont/2024/08/01/time-grants/2024/08/01/links.html crawl

Whoever handles this needs to make it a config. Usually we want to ignore local files unless we know beforehand that we need to crawl some files on our disk machine.

j-mendez commented 2 months ago

The selectors should be appropriate for the links. The issue is you cannot get local files when crawling remotely.