spider-rs / spider

A web crawler and scraper for Rust
https://spider.cloud
MIT License
1.16k stars 101 forks source link

Already crawled URL attempted as % encoded #172

Closed apsaltis closed 8 months ago

apsaltis commented 8 months ago

Hi, I have the following code:


let mut website: Website = Website::new(webpage)
        .with_wait_for_idle_network(Some(WaitForIdleNetwork::new(Some(Duration::from_secs(30)))))
        .build()
        .unwrap();
    let mut rx2 = website.subscribe(16).unwrap();
    let start = Instant::now();
    tokio::spawn(async move {
        while let Ok(page) = rx2.recv().await {
            println!(
                "found {:?}, size: {}, is_some:{}, status:{:?}, {:?}",
                page.get_url(),
                page.get_bytes().map(|b| b.len()).unwrap_or_default(),
                page.get_bytes().is_some(),
                page.status_code,
                start.elapsed()
            );
});
website.scrape().await;

After this is running for a bit, I will start to see log info like the following:
**First from spider::utils**
[2024-03-18T15:48:12Z INFO  spider::utils] fetch - https://www.cprime.com/%22https:////www.cprime.com//resources//blog//how-to-develop-a-hospital-management-system///%22

Then, from the `println!`
found "https://www.cprime.com/%22https:////www.cprime.com//resources//blog//how-to-develop-a-hospital-management-system///%22", size: 0, is_some:false, status:404, 4028.537054042s

This pattern will occur for quite a few URLs that don't exist. I can confirm that the URL appended to the base `https://www.cprime.com/` has already been crawled. So I'm not missing pages but seems to be lots of redundancy, and 404 are being generated. 

This happens for various sites that I have tested it on.

Any thoughts on how to track this down?
j-mendez commented 8 months ago

Hi, the crawler only crawls urls that exist, there is no decoding done on the urls. If you have a lot of operations going I would not use println! and change it to stdout lock. I would also avoid using scrape on a website that large for cprime.com. Scrape stores the html content throughout the crawl. I think the issue could have been due to memory constraints if the content does not exist on the website.

There was actually an issue with the semaphore on the scrape calls with chrome leading to memory issues. A fix is coming out.

j-mendez commented 8 months ago

Should be fixed in 1.85.4, thanks for the issue!

apsaltis commented 8 months ago

Thanks for the info and also the pointer on println! -- greatly appreciated.