Closed wRAR closed 1 year ago
Merging #4978 (8659bf4) into master (737f396) will increase coverage by
0.14%
. The diff coverage is98.44%
.:exclamation: Current head 8659bf4 differs from pull request most recent head c7b90c6. Consider uploading reports for the commit c7b90c6 to get more accurate results
@@ Coverage Diff @@
## master #4978 +/- ##
==========================================
+ Coverage 88.70% 88.85% +0.14%
==========================================
Files 162 162
Lines 10787 10962 +175
Branches 1851 1894 +43
==========================================
+ Hits 9569 9740 +171
Misses 942 942
- Partials 276 280 +4
Impacted Files | Coverage Δ | |
---|---|---|
scrapy/spidermiddlewares/urllength.py | 88.00% <90.90%> (-2.48%) |
:arrow_down: |
scrapy/spidermiddlewares/offsite.py | 98.41% <94.11%> (-1.59%) |
:arrow_down: |
scrapy/spidermiddlewares/depth.py | 97.67% <95.45%> (-2.33%) |
:arrow_down: |
scrapy/core/spidermw.py | 99.44% <99.08%> (-0.56%) |
:arrow_down: |
scrapy/core/scraper.py | 85.79% <100.00%> (+0.24%) |
:arrow_up: |
scrapy/middleware.py | 92.85% <100.00%> (ø) |
|
scrapy/spidermiddlewares/referer.py | 93.24% <100.00%> (+0.13%) |
:arrow_up: |
scrapy/utils/asyncgen.py | 100.00% <100.00%> (ø) |
|
scrapy/utils/defer.py | 97.29% <100.00%> (+1.33%) |
:arrow_up: |
scrapy/utils/python.py | 88.58% <100.00%> (+0.93%) |
:arrow_up: |
... and 2 more |
Not sure why some related tests fail on 3.6 asyncio-pinned, I cannot reproduce this locally at this time.
Now I can reproduce them on a freshly built 3.6.12 while it worked on my old 3.6.9, both from pyenv.
It's Twisted 17.9.0-specific again.
pytest-twisted installs the asyncio reactor with eventloop=None, so on Twisted 17.9.0 uses new_event_loop()
. We added a workaround for this in Scrapy itself, in #4841 / #4872, but it's much harder to do for pytest-twisted, unless they add some support for that (or we drop 17.9.0). It's probably possible to install the reactor in conftest.py
though.
_AsyncCooperatorAdapter
coverage is quite good, but it does not seem complete yet. 2 out of of 3 seem like error scenarios that should not usually happen, but I wonder why __iter__
is never called, maybe it is not needed?
I see there are also a few uncovered paths in the new _process_iterable_universal
decorator. Similarly, all seem like error scenarios which should not usually happen.
If testing those error scenarios is complicated, I’m OK with not covering them for now. But if it’s trivial it would be best to go for it.
Yeah, I checked the coverage for _AsyncCooperatorAdapter
when writing tests, these two exceptions are just checks for things that should never happen so I don't know how to test them, initially I had much more asserts in that class but removed most of them, I kept these two because I was afraid that these things won't be noticed without an explicit check. On the other hand, d.callback()
already checks for the same thing which makes 1.5 of these 2 checks redundant.
As for def __iter__
, that's just the iterable protocol, but as Cooperator.coiterate
says it takes an iterator we can drop it.
I'll look at _process_iterable_universal
coverage.
:tada:
Who’s the next brave reviewer who dares have a look at this patch?
One other problem I see is that treq is not visible in logs, this seems like a drawback compared with inline requests. Is there some way to integrate approach from this PR with Scrapy logging?
Ideally I would really love to see something like this in Scrapy someday
async def parse(self, response):
res = await Request(some_url)
# response.body here
so native integration, no treq, so that everything from my spider settings like proxy settings, headers all is applied to Request, and it goes smoothly, integrated with Scrapy like a charm.
Yeah, I'd say using this to run http requests is more of a side effect, one day we indeed want some replacement for inline-reqests (there was some old ticket about integrating it).
@pawelmhm I think the goal of this PR is to make it possible to have async def parse methods, and enable future improvements like having a built-in way to await for scrapy Requests. In the current form it allows you to use asyncio and twisted libraries in callbacks much more easily, which is already pretty cool :)
Moving the middleware discussion to the top level.
These seem to be set in stone:
Now the discussion points:
For 7, keeping them sync is not an option as that will cause downgrades for 6. Making them async, as mentioned above, will cause upgrades-downgrades too, at least if a sync middleware is present, which breaks 1 as it will produce warnings, and also the code paths will be more complicated and we would want to not have surprises with large projects just upgrading the Scrapy version. So it looks like we want to make them universal. This doesn't seem to cause any problems except somewhat complicating the middleware code and requiring actual support for universal middlewares in Scrapy.
For 8, we'll need to enumerate cases to discuss them separately:
For 9, we'll need to think some more on it but so far it looks like the move will be smooth, with dropping sync support in various places.
For 10, there is one problem: Scrapy cannot know before calling a method whether it's universal (takes both sync and async iterables) or not (takes async iterables only if it's inspect.isasyncgenfunction
), and the only solution I see is adding some flag to the function object (which in practice requires a decorator) or maybe renaming the function.
For 11, as mentioned earlier we cannot ship any helpers in Scrapy as the 3rd party middleware code needs to work with old Scrapy too, but we can provide snippets in docs.
For 12, we need to research what benefits will making a new method name give us and whether it will cause problems in the future when dropping sync support.
For 10, there is one problem: Scrapy cannot know before calling a method whether it's universal (takes both sync and async iterables) or not (takes async iterables only if it's
inspect.isasyncgenfunction
), and the only solution I see is adding some flag to the function object (which in practice requires a decorator) or maybe renaming the function.[…]
For 12, we need to research what benefits will making a new method name give us and whether it will cause problems in the future when dropping sync support.
What if we allow defining the method as sync (old way), as async (new way), and for universal we go with defining the method as async but also defining a sync counterpart method with a different name, e.g. suffixed with _sync
? With this approach, we would be able to tell before run time if the code is universal. And we would not be switching names going forward, only universal implementation would use the alternative name, to be deprecated once we drop support for the sync way (in Scrapy 3?).
The middleware doesn't require async but wants to benefit from it if it's available - this is interesting as there can be non-obvious solutions here (like switching to async code only if it's detected as supported etc.)
I assume here we would go for the path of least downgrades/upgrades. For example, if the chain of middlewares contains one or more universal middlewares between 2 sync middlewares, the sync version of those universal middlwares would be used, to save on conversions. However, in any other scenario, the async version would be used. Does that make sense?
Whatever logic we decide on, I think it may be important to ensure consistent method choice in universal middlewares. Given a universal downloader middleware, the same method type should be used in both directions: if the async process_request is used, the async process_response should be used.
Thinking about that, I’m now wondering if we should allow half-implemented middlewares: middlewares that provide universal code for 1 method and sync or async for another, for example. I’m thinking this is something we may want to prevent (i.e. halt execution if found), to keep our downgrade/upgrade logic as simple as possible.
What if we allow defining the method as sync (old way), as async (new way), and for universal we go with defining the method as async but also defining a sync counterpart method with a different name, e.g. suffixed with _sync?
That's definitely better than adding an _async-suffixed method, I'll think about the details of this.
For example, if the chain of middlewares contains one or more universal middlewares between 2 sync middlewares, the sync version of those universal middlewares would be used, to save on conversions. However, in any other scenario, the async version would be used.
Well yeah, my current implementation passes anything as is to the universal middlewares, and they don't do any conversion themselves. But I'm not sure if what you wrote applies to the case you quoted, as I meant that this case could use something else than the code we call "universal" (dispatching on the type of the variable passed). But maybe this just complicates things and we should not discuss this case separately.
I think it may be important to ensure consistent method choice in universal middlewares. Given a universal downloader middleware, the same method type should be used in both directions: if the async process_request is used, the async process_response should be used.
This is an interesting point and doing this unfortunately adds complexity as you need to store how you processed the first methods.
I’m thinking this is something we may want to prevent (i.e. halt execution if found), to keep our downgrade/upgrade logic as simple as possible.
This should be indeed much simpler.
What if we allow defining the method as sync (old way), as async (new way), and for universal we go with defining the method as async but also defining a sync counterpart method with a different name, e.g. suffixed with _sync?
That's definitely better than adding an _async-suffixed method, I'll think about the details of this.
On second thought, for old Scrapy versions, _async
needs to be the way (old Scrapy can ignore that, but it cannot use _sync
).
But it does not change much. _async
would only be used for universal, new-way-only code would use async def <regular name>
.
Then, if it's implemented, there will be the following options available:
def process_foo
. Just an old-style middleware, including ones that aren't updated. The argument will be downgraded if needed.async def process_foo
. Doesn't work with old Scrapy (I should check what exactly happens, then we'll put that into the docs). The argument will be upgraded if needed.def process_foo
+ async def process_foo_async
. A temporary measure, after the migration the second method will be deprecated and then result in an error. Which method will be called depends on the argument. Old Scrapy will ignore the second method.This looks like it will work, though I'm not sure how should we store the method pairs in the MiddlewareManager.methods
dict.
I'm not sure how should we store the method pairs in the
MiddlewareManager.methods
dict.
Maybe we determine which method should be used in the case of universal middlwares, and then place that method only? At run time, the code can check whether the method is async or not, and upgrade/downgrade as needed.
I don’t think the middleware manager classes are exposed through settings, so I think we have quite some freedom here. So, for example, instead of checking every time at run time with inspect
or similar, we could do this when filling methods
, and make methods
List[Tuple[callable, bool]]
instead of List[callable]
.
Maybe we determine which method should be used in the case of universal middlwares
I don't think this is possible. If you mean examining the type of the previous method in the stack, then for the actually first method you need to examine the iterable you get from the callback.
Whatever logic we decide on, I think it may be important to ensure consistent method choice in universal middlewares. Given a universal downloader middleware, the same method type should be used in both directions: if the async process_request is used, the async process_response should be used.
Actually, downloader middlewares are out of scope (they could always return deferreds so they can already be async def) and spider middlewares don't have paired methods: _process_spider_input
doesn't take an iterable.
I have a problem with implementing downgrading. Currently all that code is synchronous and without downgrading it can be kept synchronous (we don't unroll iterables there, just wrap them in other iterables, actual unrolling including waiting on async ones will happen later). Downgrading is essentially waiting until an async generator returns all values, which needs to happen here somewhere. As the entry point for this machinery (scrape_response()
) returns a Deferred, this should technically be possible but will require changing all of those methods in that file to be async (most likely with @inlineCallbacks
) as all of them call each other. All of this should also be rolled back when removing downgrading support.
As the problem with this is _process_spider_output
and _process_spider_exception
calling each other (because of the feature with returning iterables from process_spider_exception
), and other things calling _process_spider_exception
, this seems to be easily broken by not waiting on _process_spider_output
in process_spider_exception
and either taking the result if it's available immediately or raising an exception. This, if I understand correctly, corresponds to something like "if some "async iterables returned from callbacks and process_spider_exception
method returns an async iterable, the next middleware needs to be async-aware"process_spider_output
can be downgraded but ones returned from process_spider_exception
can't" which I think is fine when documented. Now I need to make sure this indeed works.
Shouldn't we also update CrawlSpider._parse_response
, CrawlSpider._requests_to_follow
, LxmlParserLinkExtractor._extract_links
etc to support coroutines and async generators?
@auxsvr on one hand we haven't thought about this and doing this shouldn't be necessary to merge this feature as existing CrawlSpider-based spiders should continue to work and this support can be added separately (if you suspect this is not true please voice your suspicions), on other hand it should be possible to convert any CrawlSpider-based spider to a normal one if one wants to use async callbacks.
A fix for the typing issues: https://github.com/wRAR/scrapy/pull/2 (and https://github.com/wRAR/scrapy/pull/2/commits/56e2eeac1066cd2d3c710b76a3d6210b3ac257f5 specifically)
Tremendous work, thanks @wRAR!
This PR adds support for defining
async def parse_*
and using bothyield
andawait
inside, with or without the asyncio reactor. There is one caveat: if the asyncio reactor is enabled, you cannot await on Deferreds directly and need to wrap them. I've added this to docs.This change was tested on two real-world projects, for one of them I also replaced using
inline_requests
to get HTTP responses inside a callback with alternativelytreq
oraiohttp
. I didn't test the performance, but as far as I can see the existing non-async callbacks shouldn't have any significant code path changes so there should be no regressions and if there are any perf problems with async callbacks we can solve that later.This doesn't require or support async def start_requests.
My main concern with this is large and non-intuitive
scrapy.utils.defer._AsyncCooperatorAdapter
. I would like to replace it with something better if that's possible but for now it seems to work correctly. I've tried to document it as comprehensively as possible.I've split it in several commits with different features in case this help with review for someone.