Open whalebot-helmsman opened 6 years ago
Merging #3237 into master will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #3237 +/- ##
=======================================
Coverage 84.79% 84.79%
=======================================
Files 163 163
Lines 9990 9999 +9
Branches 1488 1490 +2
=======================================
+ Hits 8471 8479 +8
- Misses 1255 1256 +1
Partials 264 264
Impacted Files | Coverage Δ | |
---|---|---|
scrapy/core/engine.py | 86.75% <100.00%> (-0.70%) |
:arrow_down: |
scrapy/crawler.py | 88.26% <100.00%> (ø) |
|
scrapy/spiders/__init__.py | 100.00% <100.00%> (ø) |
|
scrapy/utils/trackref.py | 85.71% <0.00%> (+2.85%) |
:arrow_up: |
@whalebot-helmsman I'm trying to understand how changing where exception is caught could cause backwards compatibility issues in practice - do you have any examples in mind where behavior of user code will change?
do you have any examples in mind where behavior of user code will change?
I don't know, but there is a special test case for it
@lopuhin a little bit of generator magic and now it fully backward compatible
@whalebot-helmsman sorry that I didn't have time to review all changes, but what I wanted to ask was, is it possible to avoid adding a new method, and support this feature in an old start_requests
?
@lopuhin It is possible to support this feature in start_requests
. In this case existent spiders should be rewritten to adapt for new behaviour, e.g. spiders consuming URLs from external queue.
@lopuhin
is it possible to avoid adding a new method, and support this feature in an old start_requests
It is not possible: currently spiders send "WaitUntilQueueEmpty" implicitly after each request in start_requests; making it explicit is not backwards-compatible.
@whalebot-helmsman
My initial idea was to support this in async def start_requests
, where we don't have to be backwards compatible. But by creating a separate method we can make it work in older Python versions - though making Scrapy API more complex. As it is a pretty big API change, I'd like to hear more proposals for the method name (//cc @dangra @redapple @eliasdorneles). Ideas:
Can we revive this PR, as https://github.com/scrapy/scrapy/pull/3362 is not merged?
@kmike Would it be OK to get this merged at this point, or would you prefer to delay it until Scrapy 2.0?
If it’s OK to merge, I’m OK with any of the names you proposed (init_requests
, seed_requests
). We could also name it feed_requests
, for example, in line with https://github.com/ejulio/spider-feeder. In any case, I assume we would still drop the function in Scrapy 2.0, so to me the name we choose for the function is not that important in this case.
@Gallaecio I think we shouldn't merge it. @wRAR is going to work on async def start_requests
support, which would fix this. That's a good point we may want to change the behavior for regular start_requests
in 2.0 though.
I think it may be better to go for a more readable implementation of what is start_requests_with_control in the current change set, and favor readability even if it takes more lines.
At first very readable implementation was provided. This implementation breaks exception contract enforced in test. To keep contract implementation was changed in https://github.com/scrapy/scrapy/pull/3237/commits/ea613bb86e7854d3aa073291ab78b5d21cf11ad4. There is more details in Obsolete info (preserved for historical reasons) section. If we aren't interested in preserving exception contract we can revert back to first implementation.
Should we consider moving WaitUntilQueueEmpty somewhere else?
At first It was located in signals.py
(https://github.com/scrapy/scrapy/pull/3237/commits/75d5aaeb41db66e1062eb060e3650b9fc9c2dda2) . By request of @kmike(https://github.com/scrapy/scrapy/pull/3237#discussion_r184750157) I put it into Spider
. Can you agree on proper place?
At first It was located in
signals.py
(75d5aae) . By request of @kmike(#3237 (comment)) I put it intoSpider
. Can you agree on proper place?
I believe @kmike meant declaring it as scrapy.spiders.WaitUntilQueueEmpty
. That works for me.
At first very readable implementation was provided. This implementation breaks exception contract enforced in test. To keep contract implementation was changed in ea613bb. There is more details in Obsolete info (preserved for historical reasons) section. If we aren't interested in preserving exception contract we can revert back to first implementation.
I think the new implementation can be made more readable nonetheless. I’m thinking, for example, that you could define _gen
within the method itself, as a generator function, something in the lines of:
def start_requests_with_control(self):
def gen_():
pass # TODO: readable code
return __flatten(gen_())
@Gallaecio tried it here https://github.com/whalebot-helmsman/scrapy/commit/67146b20b67c3da2a83214c252b9d42df6941111 (in https://github.com/whalebot-helmsman/scrapy/tree/eagerness-fail). tests/test_crawl.py
fails with such implementation. Do you have something concrete in mind?
I believe @kmike meant declaring it as scrapy.spiders.WaitUntilQueueEmpty. That works for me.
done in https://github.com/scrapy/scrapy/pull/3237/commits/6ab3114d74978f814f4e4dd32fd693183b2f4b11
This works:
diff --git a/scrapy/crawler.py b/scrapy/crawler.py
index 54fba4ac..6243cf12 100644
--- a/scrapy/crawler.py
+++ b/scrapy/crawler.py
@@ -87,7 +87,7 @@ class Crawler:
try:
self.spider = self._create_spider(*args, **kwargs)
self.engine = self._create_engine()
- start_requests = iter(self.spider.start_requests_with_control())
+ start_requests = self.spider.start_requests_with_control()
yield self.engine.open_spider(self.spider, start_requests)
yield defer.maybeDeferred(self.engine.start)
except Exception:
diff --git a/scrapy/spiders/__init__.py b/scrapy/spiders/__init__.py
index 51d9f216..0be75a57 100644
--- a/scrapy/spiders/__init__.py
+++ b/scrapy/spiders/__init__.py
@@ -82,9 +82,15 @@ class Spider(object_ref):
yield Request(url, dont_filter=True)
def start_requests_with_control(self):
- sig = WaitUntilQueueEmpty
- gen_ = ((r, sig) if r != sig else (r,) for r in self.start_requests())
- return iflatten(gen_)
+ _start_requests = self.start_requests()
+
+ def generator():
+ for request in _start_requests:
+ yield request
+ if request != WaitUntilQueueEmpty:
+ yield WaitUntilQueueEmpty
+
+ return generator()
def make_requests_from_url(self, url):
""" This method is deprecated. """
It turns out the issue is that the current code, with a generator expression, gets self.start_requests
evaluated before the actual generator items are iterated. And that seems to be expected by that one failing test.
If self.start_requests
does not get evaluated first, and fails during iteration, the error raises at _next_request
on the engine, which captures the Exception and logs an error. So the crawler exits normally.
I’m not sure what the right behavior is here. And I wonder if the current behavior won’t be an issue when adding coroutine syntax support for start_requests
(cc @wRAR).
I think it may make sense to remove that test (we already have test_start_requests_bug_before_yield
and test_start_requests_bug_yielding
), and use yield directly in start_requests_with_control
, no need even for a nested generator function there.
test_graceful_crawl_error_handling
tests that Crawler.crawl
fails right away when an exception is raised from start_requests
, but I think it makes sense for such errors to be caught after the spider opens, during the actual crawl.
Also, that test will fail if the exception happens after the first yield anyway. You can see by adding yield {}
right before raise TestError
in the test.
Note that Crawler.start_requests
is first passed to the process_start_requests
methods of spider mws, but that shouldn't be a problem as they are documented to take an iterable.
Regarding coroutines, this shouldn't matter to ones that return a list (or a generator), but for async generator support start_requests_with_control()
would need to be changed to return an async generator that uses async for request in _start_requests
.
CrawlTestCase.test_start_requests_lazyness
fails with these changes (it was re-enabled in a later PR than this one).
If I understand the test correctly, a number in seedsseen
corresponds to a callback for a Request from start_requests
while None corresponds to a callback for a Request from parse
. With the current code the order is random, first None on my machine occurs after the number 26. With these changes all numbers occur first, in order (so all 100 requests from start_requests are processed in order, and then requests from the callback are processed). Is this behavior change desired?
And if we remove start_requests_with_control()
but keep _next_request()
change, to check how will the "correct" handling work, the numbers occur in reverse order, first None occurring after 68. This fails the test (but this can be fixed) but also shows that the start_requests processing order has changed for some reason and I don't think this is desired.
Oh, it looks like it was already discussed above :-/
Though it seems that the behavior without start_requests_with_control
wasn't covered.
As I see it, we want to use start_requests_with_control
for normal start_requests
and use the user-provided start_requests
directly if it's async def. That means we need to support both behaviors and also test them both.
I started this PR more than 2 years ago before asyncio in scrapy. I wanted to present one feature - consume all requests from start_requests
before scheduling for downloader. Does this feature available now using asyncio only in start_requests
?
@whalebot-helmsman no, the only relation to asyncio is that we want the old behavior for def start_requests()
and the new correct behavior for async def start_requests()
, to not break the existing code. So I'm working on integrating the changes of this PR on top of #4467. I intended to apply the _next_request
change, then call start_requests_with_control
for def start_requests()
but call async def start_requests()
directly.
Just to be 100% clear, asyncio
(the asyncio
reactor, I mean) should not be needed for async def start_requests
, right?
@elacuesta correct, the reactor (and the module) are unrelated, only the async def
is used as a marker.
I'm currently trying to write and run tests for this feature. I'm using a spider with CONCURRENT_REQUESTS=1
and yielding some requests from start_requests
and some requests from parse
, stopping the spider using CLOSESPIDER_PAGECOUNT=10
. With the current Scrapy code I see that only 3 requests were from start_requests
, which looks like the current behavior to me. With the _next_request
changes from this PR all 10 requests are from start_requests
, both when start_requests_with_control
is used and when it isn't. I understand this as a behavior change, but maybe I'm missing something. I've published this test added on top of this branch here.
Here is my understanding of the old and the new behavior based on an old conversation with @whalebot-helmsman.
The old code (slightly simplified):
# Move requests from the scheduler to the downloader until it's full.
while not self._needs_backout(spider):
if not self._next_request_from_scheduler(spider):
break
# If the downloader is not full, send one request from start_requests to the scheduler.
# This doesn't send requests to the downloader.
if slot.start_requests and not self._needs_backout(spider):
request = next(slot.start_requests)
self.crawl(request, spider)
The new code (slightly simplified in the same way):
# If the downloader is not full, send all requests from start_requests to the scheduler,
# until WaitUntilQueueEmpty (so just one request, if WaitUntilQueueEmpty is added after each).
# This doesn't send requests to the downloader.
while not self._needs_backout(spider) and slot.start_requests:
request = next(slot.start_requests)
if request == WaitUntilQueueEmpty:
break
self.crawl(request, spider)
# Move requests from the scheduler to the downloader until it's full.
while not self._needs_backout(spider):
if not self._next_request_from_scheduler(spider):
break
I'm currently studying how do these behavior changes affect my test cases.
If I compare the behavior of the old code and the new code in the "compatibility mode" (with emitting WaitUntilQueueEmpty) with CONCURRENT_REQUESTS=1 and the queues changed from LIFO to FIFO then the behavior is the same, and so we can test this in autotests.
If I use the real-world settings, with LIFO and CONCURRENT_REQUESTS=5, the new code always uses the requests from start_requests first, even with WaitUntilQueueEmpty. I think it's because the new code schedules a request from start_requests and immediately sends it to the downloader (because the queue is LIFO).
On the other hand, if we don't emit WaitUntilQueueEmpty, the behavior is similar to the old one, which is unexpected. This is because the requests from start_requests are moved to the scheduler right after the spider start and never reached because of LIFO.
@whalebot-helmsman what do you think? Am I missing something? Is this intended to be used with FIFO? Or maybe it only makes sense when the priorities are different?
@wRAR I am not sure, but FIFO
or LIFO
(and DEPTH_PRIORITY
) shouldn't has affect here. It is about timing: if parse
finished before https://github.com/whalebot-helmsman/scrapy/blob/6ab3114d74978f814f4e4dd32fd693183b2f4b11/scrapy/core/engine.py#L137-L139 and we have new URLs scheduled.
I am thinking about how to improve current situation.
Now I understand why LIFO
or FIFO
matters. Nothing wrong with engine we just collect information in a wrong place (after request was crawled). We should spot the moment of request passing to the scheduler, not moment after crawling.
tests/test_crawl.py::CrawlTestCase::test_start_requests_lazyness
fails for same reason in this branch. Let me fix and provide a way to check for right kind of events.
@wRAR I was able to come up with tests for both eagerness and laziness. It works with default settings (CONCURRENT_REQUESTS = 16
and SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'
). To do this I have to change two things:
number_of_start_requests = 100
, 50 also works, using only 25 breaks test_start_requests_lazyness
) , so we have time to get a response for first scheduled requestLooks like WaitUntilQueueEmpty
will be seen in process_start_requests
middleware methods, and probably break them if they assume all items are Requests. The docs say that method receives and returns iterables of Requests. How should we handle this?
@wRAR I couldn't find good solution here.
It is possible to write something like flatten(map(lambda r: if isintance(r, Request) process_start_requests((r,)) else (r, ) for r in requests))
- something completely unreadable and requiring ability to run process_start_requests
several times consecutively. Calling process_start_requests
only once is stated nowhere in documentation, but assumed.
I propose to change contract for process_start_requests
and require handling of WaitUntilQueueEmpty
object. There are 30k search results for https://github.com/search?p=1&q=%22def+process_start_requests%28%22&type=Code , most of it are generated code, like
def process_start_requests(start_requests, spider):
for r in start_requests:
yield r
some already checking for being a request
by a habit created by other middlaware methods.
Based on this, I think, there are not a lot of code affected.
Changing the contract would be the easiest way from our side, definitely. We don't like to break compatibility though.
Cc: @kmike
As it was decided some time ago that the old behavior should be kept for def start_requests
but the new behavior should be the default for async def start_requests
, and I don't see a way to clearly separate this into two PRs, I've added the code to #4467. It merges this PR and adds the following changes on top (commit 94ce930):
start_requests_with_control
is removed as the user can change the behavior by moving to async def start_requests
start_requests
function is determined and the old or the new code in _next_request
is executed, to reduce code duplication I've extracted two functions from itWe hope that in the future we will be able to change the architecture to keep the behavior in simpler cases but add a way for users to wait on some "queue is empty" event (ideas and prototypes are welcome!).
any news on this PR guys?
@psdon this feature was integrated into #4467 which unfortunately is currently frozen as we still need to discuss and decide on uncertainties in interactions between these two features.
Direct implementation of https://github.com/scrapy/scrapy/pull/1051#issuecomment-89024196, related https://github.com/scrapy/scrapy/issues/456
Add new method
start_request_with_control
to spider interface. I can change method name to any proposed. By default new method uses old one and yieldWaitUntilQueueEmpty
after every request, so changes are ~almost~ backward compatible. ~Incompatibility comes from the way python handles exceptions in functions used as generators.~Obsolete info (preserved for historical reasons)
~Old behaviour - exception raised in
start_requests
was raised during crawler creation (https://github.com/whalebot-helmsman/scrapy/blob/da1256a9c8452703213c96ebc32fae1524034442/scrapy/crawler.py#L81)~ ~New behaviour - same exception raised during obtaining next request(https://github.com/whalebot-helmsman/scrapy/blob/da1256a9c8452703213c96ebc32fae1524034442/scrapy/core/engine.py#L127)~~That is the reason why I had to change method from old one to new one here https://github.com/whalebot-helmsman/scrapy/blob/6b4303c764dcafd744621110871f394981431d8f/tests/test_crawl.py#L243.~
~To reproduce old behaviour we can additionally check for spider status here https://github.com/whalebot-helmsman/scrapy/blob/da1256a9c8452703213c96ebc32fae1524034442/scrapy/core/engine.py#L131~