sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Apache License 2.0
2.25k stars 560 forks source link

Best way to do page-by-page crawling #224

Closed okolobaxa closed 3 years ago

okolobaxa commented 3 years ago

Hello,

My source site structure looks so; http://example.com/page/1 http://example.com/page/2 and so on... But some pages may not contain what I'm looking for (images).

What is the most efficient way to crawl such a site with Abot? For now, I use my own implementation of HyperLinkParser, which queues the next page by incrementing the URL of the current crawled page. Maybe there is a more efficient way? I'm thinking of my own implementation of Scheduler with pre-calculated URLs.

Thanks!

sjdirect commented 3 years ago

Are you asking how to crawl a site that has a predictable url structure but the links are not discoverable by just following links the abot crawls or are you asking how to avoid crawling pages that do not have images?

On Thu, Dec 3, 2020, 2:39 AM Anton Kheystver notifications@github.com wrote:

Hello,

My source site structure looks so; http://example.com/page/1 http://example.com/page/2 and so on... But some pages may not contain what I'm looking for (images).

What is the most efficient way to crawl such a site with Abot? For now, I use my own implementation of HyperLinkParser, which queues the next page by incrementing the URL of the current crawled page. Maybe there is a more efficient way? I'm thinking of my own implementation of Scheduler with pre-calculated URLs.

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sjdirect/abot/issues/224, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5C3YQYTRZ3P3HCQCOQWITSS5MDVANCNFSM4ULWWI3Q .

sjdirect commented 3 years ago

Closing issue, taking the conversation to the forum