scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 216 forks source link

it is not documented how HBaseBackend prioritizes requests #334

Closed kmike closed 6 years ago

kmike commented 6 years ago

In the backends documentation it is explained how prioritization work for

For revisiting backend it is said "no prioritization" - what does it mean? Are seeds scheduled FIFO? Are requests scheduled for recrawling processed in FIFO order as well?

For HBaseBackend prioritization is not explained - is it FIFO, or something similar; maybe partitions affect it somehow, etc.?

sibiryakov commented 6 years ago

This PR is going to introduce a major change https://github.com/scrapinghub/frontera/pull/331, where backends will not be responsible for prioritisation anymore and this responsibility be transferred to crawling strategy. CS would have to schedule the request with score ranging from 0.0 to 1.0. Likely currently only HBaseBackend and RedisBackend are supporting this. The higher the score the bigger the priority of request.

kmike commented 6 years ago

Sounds great! I'm closing this issue then.