scrapinghub / scrapy-poet

Page Object pattern for Scrapy
BSD 3-Clause "New" or "Revised" License
119 stars 28 forks source link

`scrapy parse` is broken #39

Closed alexander-matsievsky closed 10 months ago

alexander-matsievsky commented 4 years ago

Spider code is taken from https://scrapy-poet.readthedocs.io/en/latest/intro/tutorial.html:

import scrapy

class BooksSpider(scrapy.Spider):
    """Crawl and extract books data"""

    name = 'books'
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        """Discover book links and follow them"""
        links = response.css('.image_container a')
        yield from response.follow_all(links, self.parse_book)

    def parse_book(self, response):
        """Extract data from book pages"""
        yield {
            'url': response.url,
            'name': response.css("title::text").get(),
        }

from web_poet.pages import ItemWebPage
from scrapy_poet import callback_for

class BookPage(ItemWebPage):
    """Individual book page on books.toscrape.com website, e.g.
    http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
    """

    def to_item(self):
        return {
            'url': self.url,
            'name': self.css("title::text").get(),
        }

class BooksScrapyPoetSpider(scrapy.Spider):
    """Crawl and extract books data"""

    name = 'books_scrapy_poet'
    start_urls = ['http://books.toscrape.com/']
    parse_book = callback_for(BookPage)  # extract items from book pages

    def parse(self, response):
        """Discover book links and follow them"""
        links = response.css('.image_container a')
        yield from response.follow_all(links, self.parse_book)

Passing

scrapy parse --spider=books -c parse_book http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'name': '\n    A Light in the Attic | Books to Scrape - Sandbox\n',
  'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'}]

# Requests  -----------------------------------------------------------------
[]

Failing

scrapy parse --spider=books_scrapy_poet -c parse_book http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
2020-09-15 10:09:37 [scrapy.core.scraper] ERROR: Spider error processing <GET http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html> (referer: None)
Traceback (most recent call last):
  File "/home/.../.miniconda3/envs/.../lib/python3.8/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/.../.miniconda3/envs/.../lib/python3.8/site-packages/scrapy/commands/parse.py", line 191, in callback
    items, requests = self.run_callback(response, cb, cb_kwargs)
  File "/home/.../.miniconda3/envs/.../lib/python3.8/site-packages/scrapy/commands/parse.py", line 115, in run_callback
    for x in iterate_spider_output(callback(response, **cb_kwargs)):
TypeError: parse() missing 1 required keyword-only argument: 'page'
victor-torres commented 4 years ago

scrapy parse command is hiding the original callback from us. It wraps the original callback in the Command.prepare_request method. The solution would be to leave clues about the original spider and callback on request's meta and query this information when getting the callback in scrapy-poet's utils.get_callback function.

Proposed solution

Scrapy's diff

diff --git a/scrapy/commands/parse.py b/scrapy/commands/parse.py
index 83ee074d..21dba3e8 100644
--- a/scrapy/commands/parse.py
+++ b/scrapy/commands/parse.py
@@ -215,7 +215,7 @@ class Command(BaseRunSpiderCommand):
             request.cb_kwargs.update(opts.cbkwargs)

         request.meta['_depth'] = 1
-        request.meta['_callback'] = request.callback
+        request.meta['_callback'] = request.callback or opts.callback
         request.callback = callback
         return request

scrapy-poet's diff

diff --git a/scrapy_poet/utils.py b/scrapy_poet/utils.py
index cab27c2..0ecc6e7 100644
--- a/scrapy_poet/utils.py
+++ b/scrapy_poet/utils.py
@@ -31,9 +31,12 @@ _SCRAPY_PROVIDED_CLASSES = {

 def get_callback(request, spider):
     """Get request.callback of a scrapy.Request, as a callable."""
-    if request.callback is None:
-        return getattr(spider, 'parse')
-    return request.callback
+    callback = request.meta.get('_callback', request.callback)
+    if not callable(callback):
+        callback = getattr(spider, callback or 'parse')
+
+    return callback

 class DummyResponse(Response):

Let me know what you guys think about it and I'll create the pull requests if there's a green light for it. @ivanprado @kmike @ejulio

kmike commented 4 years ago

It actually looks like a Scrapy issue, if a new callback doesn't have the same signature; it seems this should ideally be fixed on a Scrapy side, not worked around on scrapy-poet side.

victor-torres commented 4 years ago

Indeed, @kmike. Do you have any strategy in mind? Because I'm not sure it would be that easy to keep the original signature without investing too much time on it.

ivanprado commented 4 years ago

At this moment I don't know how it would be possible to copy the signature from function to function. Let's say you have:

def original_fn(v1: int, v2: str):
  pass

And you have a function that does something else apart of calling the original one:

def modified_fn(*args, **kwargs):
  do_something()
  original_fn(*args, **kwargs)

And the question is, would it be possible to "copy" signature from orignal_fn to modified_fn? I have not an answer. A look to the function inspect.signature (see here: https://github.com/python/cpython/blob/master/Lib/inspect.py#L2232) reveals that the question could not be simple.

victor-torres commented 4 years ago

+1

As far as I could see, there are a couple of options:

In both cases, I think it's a overly complicated solution that would be difficult to make pass on Scrapy reviews. Like, how would we justify such change?

I think giving a clue about the original callback on Scrapy parse command and consulting it when getting a callback on our library is a feasible and succinct solution if well documented.

ivanprado commented 4 years ago

I think I could have find the solution: The update_wrapper function https://docs.python.org/2/library/functools.html#functools.update_wrapper

See the following:


def original_fn(v1: int, v2: str):
   ...:   print(f"{v1} {v2}")
   ...:   
def do_something():
   ...:     print("something")
   ...:     
def modified_fn(*args, **kwargs):
   ...:   do_something()
   ...:   original_fn(*args, **kwargs)
   ...:   
signature(modified_fn)
Out[22]: <Signature (*args, **kwargs)>
update_wrapper(modified_fn, original_fn)
Out[23]: <function __main__.original_fn(v1: int, v2: str)>
signature(modified_fn)
Out[24]: <Signature (v1: int, v2: str)>
modified_fn(1,2)
something
1 2

So maybe it is a matter of calling update_wrapper somewhere in the Scrapy code to update the signature. @victor-torres what do you think?

victor-torres commented 4 years ago

Sounds like an option! Let me test here with my snippet.

victor-torres commented 4 years ago

Initial tests show it works, @ivanprado! 🚀

# Match the new callback signature with the original one
import functools
original_callback = getattr(spider, opts.callback)
functools.update_wrapper(callback, original_callback)

The thing is that the current code is using a response to determine the actual callback and the response is received by the callback itself... Not sure if it's completely safe to stick to the request since we're using parse command and not a "real world" spider. Would you mind double-checking it? https://github.com/scrapy/scrapy/blob/master/scrapy/commands/parse.py#L159

victor-torres commented 4 years ago

Another alternative would be to check if opts.callback is defined before trying to update the wrapper, but I think it wouldn't be a nice solution since the default callback (parse) could also fall into the same problem...

ivanprado commented 4 years ago

@victor-torres I'm not entirely sure about what I'm going to say, but this how I see it: 1) I would move part of the callback discovery to this line: https://github.com/scrapy/scrapy/blob/master/scrapy/commands/parse.py#L218

Particularly, I would check there if it is comes from options, otherwise, I would check if callable, and if it is a string, I would try to obtain if from the spider. At this point you would have a callback that you can use with update_wrapper to copy the signature. If you don't have anything, then you don't run update_wrapper.

2) I would update the callback. Now it would not be handling most of the callback discovery. Only the part for the rules that seem to come from the response. Note: I have no idea what is this rules thing. Do you know what is that? Do you think this could be used in combination with poet callbacks?

That's all I can say. I don't have a deep understanding of this part of the code, so this is as far as I can go by now. How do you see it @victor-torres ?

victor-torres commented 4 years ago

I'm not sure either. Maybe @kmike has a better understanding of this piece of code. I suspect we would need the response anyways.