scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Support for async callbacks #4978

Closed wRAR closed 1 year ago

wRAR commented 3 years ago

This PR adds support for defining async def parse_* and using both yield and await inside, with or without the asyncio reactor. There is one caveat: if the asyncio reactor is enabled, you cannot await on Deferreds directly and need to wrap them. I've added this to docs.

This change was tested on two real-world projects, for one of them I also replaced using inline_requests to get HTTP responses inside a callback with alternatively treq or aiohttp. I didn't test the performance, but as far as I can see the existing non-async callbacks shouldn't have any significant code path changes so there should be no regressions and if there are any perf problems with async callbacks we can solve that later.

This doesn't require or support async def start_requests.

My main concern with this is large and non-intuitive scrapy.utils.defer._AsyncCooperatorAdapter. I would like to replace it with something better if that's possible but for now it seems to work correctly. I've tried to document it as comprehensively as possible.

I've split it in several commits with different features in case this help with review for someone.

codecov[bot] commented 3 years ago

Codecov Report

Merging #4978 (8659bf4) into master (737f396) will increase coverage by 0.14%. The diff coverage is 98.44%.

:exclamation: Current head 8659bf4 differs from pull request most recent head c7b90c6. Consider uploading reports for the commit c7b90c6 to get more accurate results

@@            Coverage Diff             @@
##           master    #4978      +/-   ##
==========================================
+ Coverage   88.70%   88.85%   +0.14%     
==========================================
  Files         162      162              
  Lines       10787    10962     +175     
  Branches     1851     1894      +43     
==========================================
+ Hits         9569     9740     +171     
  Misses        942      942              
- Partials      276      280       +4     
Impacted Files Coverage Δ
scrapy/spidermiddlewares/urllength.py 88.00% <90.90%> (-2.48%) :arrow_down:
scrapy/spidermiddlewares/offsite.py 98.41% <94.11%> (-1.59%) :arrow_down:
scrapy/spidermiddlewares/depth.py 97.67% <95.45%> (-2.33%) :arrow_down:
scrapy/core/spidermw.py 99.44% <99.08%> (-0.56%) :arrow_down:
scrapy/core/scraper.py 85.79% <100.00%> (+0.24%) :arrow_up:
scrapy/middleware.py 92.85% <100.00%> (ø)
scrapy/spidermiddlewares/referer.py 93.24% <100.00%> (+0.13%) :arrow_up:
scrapy/utils/asyncgen.py 100.00% <100.00%> (ø)
scrapy/utils/defer.py 97.29% <100.00%> (+1.33%) :arrow_up:
scrapy/utils/python.py 88.58% <100.00%> (+0.93%) :arrow_up:
... and 2 more
wRAR commented 3 years ago

Not sure why some related tests fail on 3.6 asyncio-pinned, I cannot reproduce this locally at this time.

wRAR commented 3 years ago

Now I can reproduce them on a freshly built 3.6.12 while it worked on my old 3.6.9, both from pyenv.

wRAR commented 3 years ago

It's Twisted 17.9.0-specific again.

wRAR commented 3 years ago

pytest-twisted installs the asyncio reactor with eventloop=None, so on Twisted 17.9.0 uses new_event_loop(). We added a workaround for this in Scrapy itself, in #4841 / #4872, but it's much harder to do for pytest-twisted, unless they add some support for that (or we drop 17.9.0). It's probably possible to install the reactor in conftest.py though.

Gallaecio commented 3 years ago

_AsyncCooperatorAdapter coverage is quite good, but it does not seem complete yet. 2 out of of 3 seem like error scenarios that should not usually happen, but I wonder why __iter__ is never called, maybe it is not needed?

I see there are also a few uncovered paths in the new _process_iterable_universal decorator. Similarly, all seem like error scenarios which should not usually happen.

If testing those error scenarios is complicated, I’m OK with not covering them for now. But if it’s trivial it would be best to go for it.

wRAR commented 3 years ago

Yeah, I checked the coverage for _AsyncCooperatorAdapter when writing tests, these two exceptions are just checks for things that should never happen so I don't know how to test them, initially I had much more asserts in that class but removed most of them, I kept these two because I was afraid that these things won't be noticed without an explicit check. On the other hand, d.callback() already checks for the same thing which makes 1.5 of these 2 checks redundant.

As for def __iter__, that's just the iterable protocol, but as Cooperator.coiterate says it takes an iterator we can drop it.

I'll look at _process_iterable_universal coverage.

Gallaecio commented 3 years ago

:tada:

Who’s the next brave reviewer who dares have a look at this patch?

pawelmhm commented 3 years ago

One other problem I see is that treq is not visible in logs, this seems like a drawback compared with inline requests. Is there some way to integrate approach from this PR with Scrapy logging?

pawelmhm commented 3 years ago

Ideally I would really love to see something like this in Scrapy someday

async def parse(self, response):
    res = await Request(some_url)
   # response.body here

so native integration, no treq, so that everything from my spider settings like proxy settings, headers all is applied to Request, and it goes smoothly, integrated with Scrapy like a charm.

wRAR commented 3 years ago

Yeah, I'd say using this to run http requests is more of a side effect, one day we indeed want some replacement for inline-reqests (there was some old ticket about integrating it).

kmike commented 3 years ago

@pawelmhm I think the goal of this PR is to make it possible to have async def parse methods, and enable future improvements like having a built-in way to await for scrapy Requests. In the current form it allows you to use asyncio and twisted libraries in callbacks much more easily, which is already pretty cool :)

wRAR commented 2 years ago

Moving the middleware discussion to the top level.

These seem to be set in stone:

  1. We want user projects with all-sync code to still work, and not even produce warnings, and ideally use the same code paths as before.
  2. Similar for old 3rd-party middlewares (so all-sync too).
  3. If the user project switches some callback to async we are fine with requiring modification of user middlewares in that project and/or emitting warnings for code in that project.
  4. Whatever the user does, we shouldn't mandate things that are incompatible with each other or in some other way cause a user project to not be able to convert.
  5. "Simple" 3rd-party middlewares (ones that don't use async features and maybe ones that can optionally use them for things like performance or optional features, so a majority of them) should be able to maintain code that works with both old Scrapy and with new Scrapy+async user code.
  6. If a user project has all code as async and no "legacy" (whatever that means) 3rd-party middlewares it should work without things like implicit downgrades.

Now the discussion points:

  1. What to do with builtin middlewares: keep them sync, make them async, make them universal?
  2. What to do with 3rd party middlewares in various cases? There may be several separate things here depending on the middleware.
  3. Is there a path to the future where all code is async or does it require breaking so much that we don't need to think about it now?
  4. If we want some universal middlewares, how to handle them in Scrapy?
  5. How to help people with writing universal middlewares (user or 3rd party)?
  6. Is it possible to make some of this easier by e.g. providing a separate async method instead of trying to use the same name for both kinds?

For 7, keeping them sync is not an option as that will cause downgrades for 6. Making them async, as mentioned above, will cause upgrades-downgrades too, at least if a sync middleware is present, which breaks 1 as it will produce warnings, and also the code paths will be more complicated and we would want to not have surprises with large projects just upgrading the Scrapy version. So it looks like we want to make them universal. This doesn't seem to cause any problems except somewhat complicating the middleware code and requiring actual support for universal middlewares in Scrapy.

For 8, we'll need to enumerate cases to discuss them separately:

For 9, we'll need to think some more on it but so far it looks like the move will be smooth, with dropping sync support in various places.

For 10, there is one problem: Scrapy cannot know before calling a method whether it's universal (takes both sync and async iterables) or not (takes async iterables only if it's inspect.isasyncgenfunction), and the only solution I see is adding some flag to the function object (which in practice requires a decorator) or maybe renaming the function.

For 11, as mentioned earlier we cannot ship any helpers in Scrapy as the 3rd party middleware code needs to work with old Scrapy too, but we can provide snippets in docs.

For 12, we need to research what benefits will making a new method name give us and whether it will cause problems in the future when dropping sync support.

Gallaecio commented 2 years ago

For 10, there is one problem: Scrapy cannot know before calling a method whether it's universal (takes both sync and async iterables) or not (takes async iterables only if it's inspect.isasyncgenfunction), and the only solution I see is adding some flag to the function object (which in practice requires a decorator) or maybe renaming the function.

[…]

For 12, we need to research what benefits will making a new method name give us and whether it will cause problems in the future when dropping sync support.

What if we allow defining the method as sync (old way), as async (new way), and for universal we go with defining the method as async but also defining a sync counterpart method with a different name, e.g. suffixed with _sync? With this approach, we would be able to tell before run time if the code is universal. And we would not be switching names going forward, only universal implementation would use the alternative name, to be deprecated once we drop support for the sync way (in Scrapy 3?).

Gallaecio commented 2 years ago

The middleware doesn't require async but wants to benefit from it if it's available - this is interesting as there can be non-obvious solutions here (like switching to async code only if it's detected as supported etc.)

I assume here we would go for the path of least downgrades/upgrades. For example, if the chain of middlewares contains one or more universal middlewares between 2 sync middlewares, the sync version of those universal middlwares would be used, to save on conversions. However, in any other scenario, the async version would be used. Does that make sense?

Whatever logic we decide on, I think it may be important to ensure consistent method choice in universal middlewares. Given a universal downloader middleware, the same method type should be used in both directions: if the async process_request is used, the async process_response should be used.

Thinking about that, I’m now wondering if we should allow half-implemented middlewares: middlewares that provide universal code for 1 method and sync or async for another, for example. I’m thinking this is something we may want to prevent (i.e. halt execution if found), to keep our downgrade/upgrade logic as simple as possible.

wRAR commented 2 years ago

What if we allow defining the method as sync (old way), as async (new way), and for universal we go with defining the method as async but also defining a sync counterpart method with a different name, e.g. suffixed with _sync?

That's definitely better than adding an _async-suffixed method, I'll think about the details of this.

For example, if the chain of middlewares contains one or more universal middlewares between 2 sync middlewares, the sync version of those universal middlewares would be used, to save on conversions. However, in any other scenario, the async version would be used.

Well yeah, my current implementation passes anything as is to the universal middlewares, and they don't do any conversion themselves. But I'm not sure if what you wrote applies to the case you quoted, as I meant that this case could use something else than the code we call "universal" (dispatching on the type of the variable passed). But maybe this just complicates things and we should not discuss this case separately.

I think it may be important to ensure consistent method choice in universal middlewares. Given a universal downloader middleware, the same method type should be used in both directions: if the async process_request is used, the async process_response should be used.

This is an interesting point and doing this unfortunately adds complexity as you need to store how you processed the first methods.

I’m thinking this is something we may want to prevent (i.e. halt execution if found), to keep our downgrade/upgrade logic as simple as possible.

This should be indeed much simpler.

Gallaecio commented 2 years ago

What if we allow defining the method as sync (old way), as async (new way), and for universal we go with defining the method as async but also defining a sync counterpart method with a different name, e.g. suffixed with _sync?

That's definitely better than adding an _async-suffixed method, I'll think about the details of this.

On second thought, for old Scrapy versions, _async needs to be the way (old Scrapy can ignore that, but it cannot use _sync).

But it does not change much. _async would only be used for universal, new-way-only code would use async def <regular name>.

wRAR commented 2 years ago

Then, if it's implemented, there will be the following options available:

  1. Only def process_foo. Just an old-style middleware, including ones that aren't updated. The argument will be downgraded if needed.
  2. Only async def process_foo. Doesn't work with old Scrapy (I should check what exactly happens, then we'll put that into the docs). The argument will be upgraded if needed.
  3. def process_foo + async def process_foo_async. A temporary measure, after the migration the second method will be deprecated and then result in an error. Which method will be called depends on the argument. Old Scrapy will ignore the second method.

This looks like it will work, though I'm not sure how should we store the method pairs in the MiddlewareManager.methods dict.

Gallaecio commented 2 years ago

I'm not sure how should we store the method pairs in the MiddlewareManager.methods dict.

Maybe we determine which method should be used in the case of universal middlwares, and then place that method only? At run time, the code can check whether the method is async or not, and upgrade/downgrade as needed.

I don’t think the middleware manager classes are exposed through settings, so I think we have quite some freedom here. So, for example, instead of checking every time at run time with inspect or similar, we could do this when filling methods, and make methods List[Tuple[callable, bool]] instead of List[callable].

wRAR commented 2 years ago

Maybe we determine which method should be used in the case of universal middlwares

I don't think this is possible. If you mean examining the type of the previous method in the stack, then for the actually first method you need to examine the iterable you get from the callback.

wRAR commented 2 years ago

Whatever logic we decide on, I think it may be important to ensure consistent method choice in universal middlewares. Given a universal downloader middleware, the same method type should be used in both directions: if the async process_request is used, the async process_response should be used.

Actually, downloader middlewares are out of scope (they could always return deferreds so they can already be async def) and spider middlewares don't have paired methods: _process_spider_input doesn't take an iterable.

wRAR commented 2 years ago

I have a problem with implementing downgrading. Currently all that code is synchronous and without downgrading it can be kept synchronous (we don't unroll iterables there, just wrap them in other iterables, actual unrolling including waiting on async ones will happen later). Downgrading is essentially waiting until an async generator returns all values, which needs to happen here somewhere. As the entry point for this machinery (scrape_response()) returns a Deferred, this should technically be possible but will require changing all of those methods in that file to be async (most likely with @inlineCallbacks) as all of them call each other. All of this should also be rolled back when removing downgrading support.

wRAR commented 2 years ago

As the problem with this is _process_spider_output and _process_spider_exception calling each other (because of the feature with returning iterables from process_spider_exception), and other things calling _process_spider_exception, this seems to be easily broken by not waiting on _process_spider_output in process_spider_exception and either taking the result if it's available immediately or raising an exception. This, if I understand correctly, corresponds to something like "if some process_spider_exception method returns an async iterable, the next middleware needs to be async-aware" "async iterables returned from callbacks and process_spider_output can be downgraded but ones returned from process_spider_exception can't" which I think is fine when documented. Now I need to make sure this indeed works.

auxsvr commented 1 year ago

Shouldn't we also update CrawlSpider._parse_response, CrawlSpider._requests_to_follow, LxmlParserLinkExtractor._extract_links etc to support coroutines and async generators?

wRAR commented 1 year ago

@auxsvr on one hand we haven't thought about this and doing this shouldn't be necessary to merge this feature as existing CrawlSpider-based spiders should continue to work and this support can be added separately (if you suspect this is not true please voice your suspicions), on other hand it should be possible to convert any CrawlSpider-based spider to a normal one if one wants to use async callbacks.

kmike commented 1 year ago

A fix for the typing issues: https://github.com/wRAR/scrapy/pull/2 (and https://github.com/wRAR/scrapy/pull/2/commits/56e2eeac1066cd2d3c710b76a3d6210b3ac257f5 specifically)

kmike commented 1 year ago

Tremendous work, thanks @wRAR!