spider-rs / spider

A web crawler and scraper for Rust
https://spider.cloud
MIT License
1.16k stars 100 forks source link

Broadcast never end when scraping with limit #210

Closed DimitriTimoz closed 2 months ago

DimitriTimoz commented 2 months ago

The following code never end when spider is in scraping mode, no new pages, only a locking on .recv()

  let mut website: Website = Website::new("https://rsseau.fr");
  let mut rx2: tokio::sync::broadcast::Receiver<spider::page::Page> =
  website.subscribe(0).unwrap();
  website.with_limit(1);
  let join_handle = tokio::spawn(async move {
      while let Ok(res) = rx2.recv().await {
          println!("page");
      }
  });
  website.scrape().await; // End with crawling 
  website.unsubscribe();
  join_handle.await.unwrap();
j-mendez commented 2 months ago

Using the await on the join_handle with the subscription is incorrect. It needs to drop on it's own. We have a subscription guard for the chrome as needed. Take a look at the examples repo to learn more.

j-mendez commented 2 months ago
  let mut website: Website = Website::new("https://rsseau.fr");
  let mut rx2: tokio::sync::broadcast::Receiver<spider::page::Page> =
  website.subscribe(0).unwrap();
  website.with_limit(1);
  let join_handle = tokio::spawn(async move {
      while let Ok(res) = rx2.recv().await {
          println!("page");
      }
  });
  website.scrape().await; // End with crawling 
  website.unsubscribe();
  join_handle.await.unwrap();

Using the subscription should mainly be with crawling. When scraping data you are holding the content in memory It would be beneficial to pick one not both.