spider-rs / spider

A web crawler and scraper for Rust
https://spider.cloud
MIT License
1.11k stars 96 forks source link

Help wanted: Reduce memory footprint #204

Closed Falumpaset closed 2 months ago

Falumpaset commented 2 months ago

Hey,

Im crawling some sites in parallel. However, these sites are very big. The crawlers memory consumption does increase over time. Im countering that by increasing the swap size. This should be a temporary solution.

Is there a way to not have it store the visited pages in memory? I dont need them because I'm subscribing to the crawler and processing the visited pages on the fly. Any ideas?

See my implementation below.

Help is very much appreciated!

Kind regards.

pub async fn crawl(&self) -> anyhow::Result<()> {
        log::info!("Running");

        let ua = ua_generator::ua::spoof_ua();

        let config = Configuration::new()
            .with_user_agent(Some(ua))
            .with_respect_robots_txt(false)
            .with_delay(60)
            .build();

        let mut handles = Vec::with_capacity((self.crawl_list.len() * 2) + 1);
        for website_url in &self.crawl_list {
            match Website::new(&website_url)
                .with_config(config.to_owned())
                .build()
            {
                Ok(mut website) => {
                    log::error!("Crawling site {:?}", website_url);
                    let mut rx2: tokio::sync::broadcast::Receiver<spider::page::Page> =
                        website.subscribe(128).unwrap();
                    let page_sender = self.link_service.clone();
                    let shutdown_sender = self.link_service.clone();

                    let channel_handle = tokio::spawn(async move {
                        while let Ok(res) = rx2.recv().await {
                            let thread_tx = page_sender.clone();
                            tokio::spawn(async move {
                                let command = LeakPageServiceCommand::ProcessPage(res);
                                log::info!("Sending page to processing");
                                let _ = thread_tx.send(command).await;
                            });
                        }
                        log::error!("Channel closed")
                    });
                    handles.push(channel_handle);

                    let website_handle = tokio::spawn(async move {
                        website.crawl().await;
                        log::error!("sending shutdown signal");

                        let _ = shutdown_sender
                            .send(LeakPageServiceCommand::Terminate)
                            .await;
                        website.clear();
                        website.unsubscribe();
                    });
                    handles.push(website_handle);
                }
                Err(e) => log::error!("Fatal error {:?}", e),
            }
        }
        for handle in handles {
            let _ = handle.await;
        }

        Ok(())
    }
j-mendez commented 2 months ago

Use jemalloc for memory, we need the visited links to prevent re-crawling the same url per run.

Falumpaset commented 2 months ago

Is there any documentation on how to use spider with jemalloc? I see that theres a jemalloc feature flag. Could you please provide an example? Very much appreciated!

j-mendez commented 2 months ago

Is there any documentation on how to use spider with jemalloc? I see that theres a jemalloc feature flag. Could you please provide an example? Very much appreciated!

No, it just swaps the memory backend. You want to usually do this manually on your own at the top of your entry point. Spider can handle this with the flag you mentioned as well.

j-mendez commented 2 months ago

You can also use the “fs” feature flag to stream the response to disk and retrieve it async after finishing. This helps prevent memory held at once for the content.

j-mendez commented 1 month ago

@Falumpaset we now use string interning for links visited. This should help out too!