scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Ability to control consumption of start_requests from spider #3237

Open whalebot-helmsman opened 6 years ago

whalebot-helmsman commented 6 years ago

Direct implementation of https://github.com/scrapy/scrapy/pull/1051#issuecomment-89024196, related https://github.com/scrapy/scrapy/issues/456

Add new method start_request_with_control to spider interface. I can change method name to any proposed. By default new method uses old one and yield WaitUntilQueueEmpty after every request, so changes are ~almost~ backward compatible. ~Incompatibility comes from the way python handles exceptions in functions used as generators.~

Obsolete info (preserved for historical reasons)

~Old behaviour - exception raised in start_requests was raised during crawler creation (https://github.com/whalebot-helmsman/scrapy/blob/da1256a9c8452703213c96ebc32fae1524034442/scrapy/crawler.py#L81)~ ~New behaviour - same exception raised during obtaining next request(https://github.com/whalebot-helmsman/scrapy/blob/da1256a9c8452703213c96ebc32fae1524034442/scrapy/core/engine.py#L127)~

~That is the reason why I had to change method from old one to new one here https://github.com/whalebot-helmsman/scrapy/blob/6b4303c764dcafd744621110871f394981431d8f/tests/test_crawl.py#L243.~

~To reproduce old behaviour we can additionally check for spider status here https://github.com/whalebot-helmsman/scrapy/blob/da1256a9c8452703213c96ebc32fae1524034442/scrapy/core/engine.py#L131~

codecov[bot] commented 6 years ago

Codecov Report

Merging #3237 into master will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #3237   +/-   ##
=======================================
  Coverage   84.79%   84.79%           
=======================================
  Files         163      163           
  Lines        9990     9999    +9     
  Branches     1488     1490    +2     
=======================================
+ Hits         8471     8479    +8     
- Misses       1255     1256    +1     
  Partials      264      264           
Impacted Files Coverage Δ
scrapy/core/engine.py 86.75% <100.00%> (-0.70%) :arrow_down:
scrapy/crawler.py 88.26% <100.00%> (ø)
scrapy/spiders/__init__.py 100.00% <100.00%> (ø)
scrapy/utils/trackref.py 85.71% <0.00%> (+2.85%) :arrow_up:
lopuhin commented 6 years ago

@whalebot-helmsman I'm trying to understand how changing where exception is caught could cause backwards compatibility issues in practice - do you have any examples in mind where behavior of user code will change?

whalebot-helmsman commented 6 years ago

do you have any examples in mind where behavior of user code will change?

I don't know, but there is a special test case for it

whalebot-helmsman commented 6 years ago

@lopuhin a little bit of generator magic and now it fully backward compatible

lopuhin commented 6 years ago

@whalebot-helmsman sorry that I didn't have time to review all changes, but what I wanted to ask was, is it possible to avoid adding a new method, and support this feature in an old start_requests?

whalebot-helmsman commented 6 years ago

@lopuhin It is possible to support this feature in start_requests. In this case existent spiders should be rewritten to adapt for new behaviour, e.g. spiders consuming URLs from external queue.

kmike commented 6 years ago

@lopuhin

is it possible to avoid adding a new method, and support this feature in an old start_requests

It is not possible: currently spiders send "WaitUntilQueueEmpty" implicitly after each request in start_requests; making it explicit is not backwards-compatible.

@whalebot-helmsman

My initial idea was to support this in async def start_requests, where we don't have to be backwards compatible. But by creating a separate method we can make it work in older Python versions - though making Scrapy API more complex. As it is a pretty big API change, I'd like to hear more proposals for the method name (//cc @dangra @redapple @eliasdorneles). Ideas:

whalebot-helmsman commented 5 years ago

Can we revive this PR, as https://github.com/scrapy/scrapy/pull/3362 is not merged?

Gallaecio commented 4 years ago

@kmike Would it be OK to get this merged at this point, or would you prefer to delay it until Scrapy 2.0?

If it’s OK to merge, I’m OK with any of the names you proposed (init_requests, seed_requests). We could also name it feed_requests, for example, in line with https://github.com/ejulio/spider-feeder. In any case, I assume we would still drop the function in Scrapy 2.0, so to me the name we choose for the function is not that important in this case.

kmike commented 4 years ago

@Gallaecio I think we shouldn't merge it. @wRAR is going to work on async def start_requests support, which would fix this. That's a good point we may want to change the behavior for regular start_requests in 2.0 though.

whalebot-helmsman commented 4 years ago

I think it may be better to go for a more readable implementation of what is start_requests_with_control in the current change set, and favor readability even if it takes more lines.

At first very readable implementation was provided. This implementation breaks exception contract enforced in test. To keep contract implementation was changed in https://github.com/scrapy/scrapy/pull/3237/commits/ea613bb86e7854d3aa073291ab78b5d21cf11ad4. There is more details in Obsolete info (preserved for historical reasons) section. If we aren't interested in preserving exception contract we can revert back to first implementation.

Should we consider moving WaitUntilQueueEmpty somewhere else?

At first It was located in signals.py (https://github.com/scrapy/scrapy/pull/3237/commits/75d5aaeb41db66e1062eb060e3650b9fc9c2dda2) . By request of @kmike(https://github.com/scrapy/scrapy/pull/3237#discussion_r184750157) I put it into Spider. Can you agree on proper place?

Gallaecio commented 4 years ago

At first It was located in signals.py (75d5aae) . By request of @kmike(#3237 (comment)) I put it into Spider. Can you agree on proper place?

I believe @kmike meant declaring it as scrapy.spiders.WaitUntilQueueEmpty. That works for me.

Gallaecio commented 4 years ago

At first very readable implementation was provided. This implementation breaks exception contract enforced in test. To keep contract implementation was changed in ea613bb. There is more details in Obsolete info (preserved for historical reasons) section. If we aren't interested in preserving exception contract we can revert back to first implementation.

I think the new implementation can be made more readable nonetheless. I’m thinking, for example, that you could define _gen within the method itself, as a generator function, something in the lines of:

    def start_requests_with_control(self):
        def gen_():
            pass  # TODO: readable code

        return __flatten(gen_())
whalebot-helmsman commented 4 years ago

@Gallaecio tried it here https://github.com/whalebot-helmsman/scrapy/commit/67146b20b67c3da2a83214c252b9d42df6941111 (in https://github.com/whalebot-helmsman/scrapy/tree/eagerness-fail). tests/test_crawl.py fails with such implementation. Do you have something concrete in mind?

whalebot-helmsman commented 4 years ago

I believe @kmike meant declaring it as scrapy.spiders.WaitUntilQueueEmpty. That works for me.

done in https://github.com/scrapy/scrapy/pull/3237/commits/6ab3114d74978f814f4e4dd32fd693183b2f4b11

Gallaecio commented 4 years ago

This works:

diff --git a/scrapy/crawler.py b/scrapy/crawler.py
index 54fba4ac..6243cf12 100644
--- a/scrapy/crawler.py
+++ b/scrapy/crawler.py
@@ -87,7 +87,7 @@ class Crawler:
         try:
             self.spider = self._create_spider(*args, **kwargs)
             self.engine = self._create_engine()
-            start_requests = iter(self.spider.start_requests_with_control())
+            start_requests = self.spider.start_requests_with_control()
             yield self.engine.open_spider(self.spider, start_requests)
             yield defer.maybeDeferred(self.engine.start)
         except Exception:
diff --git a/scrapy/spiders/__init__.py b/scrapy/spiders/__init__.py
index 51d9f216..0be75a57 100644
--- a/scrapy/spiders/__init__.py
+++ b/scrapy/spiders/__init__.py
@@ -82,9 +82,15 @@ class Spider(object_ref):
                 yield Request(url, dont_filter=True)

     def start_requests_with_control(self):
-        sig = WaitUntilQueueEmpty
-        gen_ = ((r, sig) if r != sig else (r,) for r in self.start_requests())
-        return iflatten(gen_)
+        _start_requests = self.start_requests()
+
+        def generator():
+            for request in _start_requests:
+                yield request
+                if request != WaitUntilQueueEmpty:
+                    yield WaitUntilQueueEmpty
+
+        return generator()

     def make_requests_from_url(self, url):
         """ This method is deprecated. """

It turns out the issue is that the current code, with a generator expression, gets self.start_requests evaluated before the actual generator items are iterated. And that seems to be expected by that one failing test.

If self.start_requests does not get evaluated first, and fails during iteration, the error raises at _next_request on the engine, which captures the Exception and logs an error. So the crawler exits normally.

I’m not sure what the right behavior is here. And I wonder if the current behavior won’t be an issue when adding coroutine syntax support for start_requests (cc @wRAR).

I think it may make sense to remove that test (we already have test_start_requests_bug_before_yield and test_start_requests_bug_yielding), and use yield directly in start_requests_with_control, no need even for a nested generator function there.

test_graceful_crawl_error_handling tests that Crawler.crawl fails right away when an exception is raised from start_requests, but I think it makes sense for such errors to be caught after the spider opens, during the actual crawl.

Also, that test will fail if the exception happens after the first yield anyway. You can see by adding yield {} right before raise TestError in the test.

wRAR commented 4 years ago

Note that Crawler.start_requests is first passed to the process_start_requests methods of spider mws, but that shouldn't be a problem as they are documented to take an iterable.

Regarding coroutines, this shouldn't matter to ones that return a list (or a generator), but for async generator support start_requests_with_control() would need to be changed to return an async generator that uses async for request in _start_requests.

wRAR commented 3 years ago

CrawlTestCase.test_start_requests_lazyness fails with these changes (it was re-enabled in a later PR than this one).

If I understand the test correctly, a number in seedsseen corresponds to a callback for a Request from start_requests while None corresponds to a callback for a Request from parse. With the current code the order is random, first None on my machine occurs after the number 26. With these changes all numbers occur first, in order (so all 100 requests from start_requests are processed in order, and then requests from the callback are processed). Is this behavior change desired?

And if we remove start_requests_with_control() but keep _next_request() change, to check how will the "correct" handling work, the numbers occur in reverse order, first None occurring after 68. This fails the test (but this can be fixed) but also shows that the start_requests processing order has changed for some reason and I don't think this is desired.

wRAR commented 3 years ago

Oh, it looks like it was already discussed above :-/

wRAR commented 3 years ago

Though it seems that the behavior without start_requests_with_control wasn't covered.

As I see it, we want to use start_requests_with_control for normal start_requests and use the user-provided start_requests directly if it's async def. That means we need to support both behaviors and also test them both.

whalebot-helmsman commented 3 years ago

I started this PR more than 2 years ago before asyncio in scrapy. I wanted to present one feature - consume all requests from start_requests before scheduling for downloader. Does this feature available now using asyncio only in start_requests?

wRAR commented 3 years ago

@whalebot-helmsman no, the only relation to asyncio is that we want the old behavior for def start_requests() and the new correct behavior for async def start_requests(), to not break the existing code. So I'm working on integrating the changes of this PR on top of #4467. I intended to apply the _next_request change, then call start_requests_with_control for def start_requests() but call async def start_requests() directly.

elacuesta commented 3 years ago

Just to be 100% clear, asyncio (the asyncio reactor, I mean) should not be needed for async def start_requests, right?

wRAR commented 3 years ago

@elacuesta correct, the reactor (and the module) are unrelated, only the async def is used as a marker.

wRAR commented 3 years ago

I'm currently trying to write and run tests for this feature. I'm using a spider with CONCURRENT_REQUESTS=1 and yielding some requests from start_requests and some requests from parse, stopping the spider using CLOSESPIDER_PAGECOUNT=10. With the current Scrapy code I see that only 3 requests were from start_requests, which looks like the current behavior to me. With the _next_request changes from this PR all 10 requests are from start_requests, both when start_requests_with_control is used and when it isn't. I understand this as a behavior change, but maybe I'm missing something. I've published this test added on top of this branch here.

wRAR commented 3 years ago

Here is my understanding of the old and the new behavior based on an old conversation with @whalebot-helmsman.

The old code (slightly simplified):

# Move requests from the scheduler to the downloader until it's full.
while not self._needs_backout(spider):
    if not self._next_request_from_scheduler(spider):
        break

# If the downloader is not full, send one request from start_requests to the scheduler.
# This doesn't send requests to the downloader.
if slot.start_requests and not self._needs_backout(spider):
    request = next(slot.start_requests)
    self.crawl(request, spider)

The new code (slightly simplified in the same way):

# If the downloader is not full, send all requests from start_requests to the scheduler,
# until WaitUntilQueueEmpty (so just one request, if WaitUntilQueueEmpty is added after each).
# This doesn't send requests to the downloader.
while not self._needs_backout(spider) and slot.start_requests:
    request = next(slot.start_requests)
    if request == WaitUntilQueueEmpty:
        break
    self.crawl(request, spider)

# Move requests from the scheduler to the downloader until it's full.
while not self._needs_backout(spider):
    if not self._next_request_from_scheduler(spider):
        break

I'm currently studying how do these behavior changes affect my test cases.

wRAR commented 3 years ago

If I compare the behavior of the old code and the new code in the "compatibility mode" (with emitting WaitUntilQueueEmpty) with CONCURRENT_REQUESTS=1 and the queues changed from LIFO to FIFO then the behavior is the same, and so we can test this in autotests.

If I use the real-world settings, with LIFO and CONCURRENT_REQUESTS=5, the new code always uses the requests from start_requests first, even with WaitUntilQueueEmpty. I think it's because the new code schedules a request from start_requests and immediately sends it to the downloader (because the queue is LIFO).

On the other hand, if we don't emit WaitUntilQueueEmpty, the behavior is similar to the old one, which is unexpected. This is because the requests from start_requests are moved to the scheduler right after the spider start and never reached because of LIFO.

@whalebot-helmsman what do you think? Am I missing something? Is this intended to be used with FIFO? Or maybe it only makes sense when the priorities are different?

whalebot-helmsman commented 3 years ago

@wRAR I am not sure, but FIFO or LIFO(and DEPTH_PRIORITY) shouldn't has affect here. It is about timing: if parse finished before https://github.com/whalebot-helmsman/scrapy/blob/6ab3114d74978f814f4e4dd32fd693183b2f4b11/scrapy/core/engine.py#L137-L139 and we have new URLs scheduled. I am thinking about how to improve current situation.

whalebot-helmsman commented 3 years ago

Now I understand why LIFO or FIFO matters. Nothing wrong with engine we just collect information in a wrong place (after request was crawled). We should spot the moment of request passing to the scheduler, not moment after crawling. tests/test_crawl.py::CrawlTestCase::test_start_requests_lazyness fails for same reason in this branch. Let me fix and provide a way to check for right kind of events.

whalebot-helmsman commented 3 years ago

@wRAR I was able to come up with tests for both eagerness and laziness. It works with default settings (CONCURRENT_REQUESTS = 16 and SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'). To do this I have to change two things:

wRAR commented 3 years ago

Looks like WaitUntilQueueEmpty will be seen in process_start_requests middleware methods, and probably break them if they assume all items are Requests. The docs say that method receives and returns iterables of Requests. How should we handle this?

whalebot-helmsman commented 3 years ago

@wRAR I couldn't find good solution here. It is possible to write something like flatten(map(lambda r: if isintance(r, Request) process_start_requests((r,)) else (r, ) for r in requests)) - something completely unreadable and requiring ability to run process_start_requests several times consecutively. Calling process_start_requests only once is stated nowhere in documentation, but assumed. I propose to change contract for process_start_requests and require handling of WaitUntilQueueEmpty object. There are 30k search results for https://github.com/search?p=1&q=%22def+process_start_requests%28%22&type=Code , most of it are generated code, like

    def process_start_requests(start_requests, spider):
        for r in start_requests:
            yield r

some already checking for being a request

https://github.com/alexmason528/torforum_crawler/blob/70964905d79818501c9fc32d3e6417dcc6eaa828/scrapyprj/middlewares/start_requests_cookie_middleware.py#L12-L18

by a habit created by other middlaware methods.

Based on this, I think, there are not a lot of code affected.

wRAR commented 3 years ago

Changing the contract would be the easiest way from our side, definitely. We don't like to break compatibility though.

Cc: @kmike

wRAR commented 3 years ago

As it was decided some time ago that the old behavior should be kept for def start_requests but the new behavior should be the default for async def start_requests, and I don't see a way to clearly separate this into two PRs, I've added the code to #4467. It merges this PR and adds the following changes on top (commit 94ce930):

We hope that in the future we will be able to change the architecture to keep the behavior in simpler cases but add a way for users to wait on some "queue is empty" event (ideas and prototypes are welcome!).

psdon commented 3 years ago

any news on this PR guys?

wRAR commented 3 years ago

@psdon this feature was integrated into #4467 which unfortunately is currently frozen as we still need to discuss and decide on uncertainties in interactions between these two features.