scrapinghub / scrapyrt

HTTP API for Scrapy spiders
BSD 3-Clause "New" or "Revised" License
832 stars 162 forks source link

[WIP] Add max_items #129

Open serhii73 opened 3 years ago

serhii73 commented 3 years ago

Close #107

pawelmhm commented 3 years ago

Thank you for this contribution @serhii73. I really appreciate it!

The idea behind this feature is to limit number of items in output. With current implementation it doesn't really do the trick. You can test it by running curl

> curl "http://localhost:9080/crawl.json?url=http://quotes.toscrape.com&spider_name=toscrape-css&max_items=1" -v | jq '.items'

I passed max_items=1 but I got 10 items.

This is because limit_items function is not called when item is scraped. To actually call it when item is scraped you need to plug it into signal item_scraped. There is already one handler registered for this signal it is method "get_item". If you move your code counting items there it should do the trick.

:+1: for adding test, test is ok, but it would be also good to add one integration test in test_resource_crawl.TestCrawlResourceIntegration passing max_items parameter to API and checking if it actually works as expected. It is possible that some argument will be work fine in crawl manager but will be filtered earlier in HTTP resource handling. This is why it is recommended to add integration test.

Finally, when we merge it we'll need to also update documentation with this parameter so that people know how to use it.