scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 215 forks source link

HBaseQueue losing urls when using BC_MAX_REQUESTS_PER_HOST #393

Closed a-shkarupin closed 4 years ago

a-shkarupin commented 4 years ago

Hi, BC_MAX_REQUESTS_PER_HOST description states:

Don’t include (if possible) batches of requests containing requests for specific host if there are already more then specified count of maximum requests per host. This is a suggestion for broad crawling queue get algorithm.

However in practice, the urls exceeding specified count of maximum requests per host are dropped for a given row key. I would expect such urls to be included in later batches, not dropped. This is the part of the code in question: https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/backends/hbase/__init__.py#L249

Is this expected behavior?

In case it is, this should be explicitly stated in the documentation to avoid possible misuse of the option.

Best regards

sibiryakov commented 4 years ago

Hi, these requests are skipped for batch of results prepared for specific iteration. They will stay in HBase table for subsequent runs. Only requests added to trash_can are removed completely.