spider-rs / spider-py

Spider ported to Python
https://spider-rs.github.io/spider-py/
MIT License
42 stars 4 forks source link

Inconsistent Crawling Behavior with Specified Depth in Spider Scraper #9

Open HarshJa1n opened 1 week ago

HarshJa1n commented 1 week ago

I am developing a spider scraper using the spider_py library and encountering issues with the crawling depth functionality. The crawling depth behavior appears inconsistent across different sites.

Issue:

For one site, I set the depth to 4, and the results were as follows:

For another site with the same depth setting, the results were different:

Expected Behavior:

However, the actual crawling behavior doesn't align with the expected depth definition, sometimes crawling more than the specified depth.

Steps to Reproduce:

  1. Set the crawling depth to 4 on different websites.
  2. Observe the number of URLs found at each depth level.

Request:

Clarification on how depth is being calculated or a potential fix to make the crawling depth behave consistently across different websites.

j-mendez commented 1 week ago

I am developing a spider scraper using the spider_py library and encountering issues with the crawling depth functionality. The crawling depth behavior appears inconsistent across different sites.

Issue:

For one site, I set the depth to 4, and the results were as follows:

  • Found 1 URL, crawled with depth = 1 & 2
  • Found 152 URLs, crawled with depth = 3
  • Found 165 URLs, crawled with depth = 4

For another site with the same depth setting, the results were different:

  • Found 1 URL, crawled with depth = 1, 2, 3
  • Found 36 URLs, crawled with depth = 4
  • Found 210 URLs, crawled with depth = 5

Expected Behavior:

  • Depth 1: Crawl only the current page
  • Depth 2: Crawl the current page and all of its forwarded links
  • Depth 3: Crawl the forwarded pages' forwarded links, and so on

However, the actual crawling behavior doesn't align with the expected depth definition, sometimes crawling more than the specified depth.

Steps to Reproduce:

  1. Set the crawling depth to 4 on different websites.
  2. Observe the number of URLs found at each depth level.

Request:

Clarification on how depth is being calculated or a potential fix to make the crawling depth behave consistently across different websites.

Hi, can you share example urls and settings used? Thanks!

HarshJa1n commented 3 days ago

Ok sure, So for example: When crawling the site brev.dev, the spider_rs library only returns 52 URLs with a depth of 4. However, at depths less than 4, the crawl results in just one URL. Similarly, when crawling the site promptfoo.dev, the library only starts returning more than one URL at a depth greater than or equal to 5.

You can infer settings from this code maybe Current Code:

from spider_rs import Website
website = Website(url)
website.with_depth(config['depth'])
website.crawl(None, None, config['use_headless'])
links = website.get_links()
return links