spider-rs / spider

A web crawler and scraper for Rust
https://spider.cloud
MIT License
1.18k stars 103 forks source link

Delay and with_delay totally breaks crawl #232

Open Revertron opened 2 hours ago

Revertron commented 2 hours ago

It seems that the website.crawl().await is not working at all when using delays like 1000 (one second). Maybe this is the cause of #225 also.

And the culprit of this is the select!() on this line: https://github.com/spider-rs/spider/blob/main/spider/src/website.rs#L2227.

The stream is throttled, so the select!() is always selecting other tasks, and never fetches any links.

j-mendez commented 2 hours ago

Hi, what version of spider are you using? The select will perform both of the tasks at the same time concurrently independent of each other. There was a bug in an older version. Make sure to use v2.13.76 and above (currently v2.13.79).

j-mendez commented 2 hours ago

Tested it locally and the delay works across all features. The issue you were facing most likely was from the example redirecting to example.com. Updated the default target url to fix this.

Revertron commented 1 hour ago

I've tried the latest version, from master. And I didn't use urls in examples.

Why did you close the issue?

From the docs of tokio:

The tokio::select! macro allows waiting on multiple async computations and returns when a single computation completes.

Did you try big delay, like 1000?

j-mendez commented 59 minutes ago

Yes, tested this with a delay of one second. The select is cancel safe because it uses a join_next from a JoinSet.

j-mendez commented 43 minutes ago

Ran it again to make sure. Going to see what happened the first time it only found 2 links. The second time it respected the delay. Thanks for the issue. Will take a look in the morning. The delay disables the concurrency so this should be something straight forward to fix.

https://github.com/user-attachments/assets/9d14bd92-4ef4-46e5-bd0f-66abde69f622