scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Support for dataclass and attrs items #3881

Closed elacuesta closed 3 years ago

elacuesta commented 4 years ago

Fixes #3761, closes #2807

It seems to me like there would be a limitation in the way we could use dataclassess with ItemLoaders. Given the fact that all fields are supposed to be provided (except for those with default values, or unless a custom constuctor is provided) and that loaders try to append values, I think only dataclasses defined as the following would work:

from dataclasses import dataclass, field
from scrapy.loader import ItemLoader

@dataclass
class InventoryItem:
    name: list = field(default_factory=list)
    price: list = field(default_factory=list)
    stock: list = field(default_factory=list)

lo = ItemLoader(item=InventoryItem())
lo.add_value('name', 'foo bar')
print(lo.load_item())  # prints "InventoryItem(name=['foo bar'], price=[], stock=[])"

Thanks in advance for your feedback!


Update

Checklist:

This patch works with the built-in dataclassess module in Python 3.7+, and with the dataclassess backport in Python 3.6.

Update 2 Checklist for the new implementation based on the ItemAdapter class

Update 3: Supporting python-attrs now too. Implementation is based on https://github.com/scrapy/itemadapter.

elacuesta commented 4 years ago

Minimal spider, with feed export support:

from dataclasses import dataclass

from scrapy import Spider

@dataclass
class InventoryItem:
    name: str
    price: float
    stock: int

class DataClassSpider(Spider):
    name = 'example'
    start_urls = ['https://example.org']

    def parse(self, response):
        return InventoryItem('Chair', 120.5, 10)
$ scrapy runspider example.py -o items.json
(...)
2019-07-15 13:16:06 [scrapy.core.engine] INFO: Spider opened
2019-07-15 13:16:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-15 13:16:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-15 13:16:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
2019-07-15 13:16:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.org>
InventoryItem(name='Chair', price=120.5, stock=10)
2019-07-15 13:16:06 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-15 13:16:06 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: items.json
(...)
$ cat items.json
[
{"name": "Chair", "price": 120.5, "stock": 10}
]
codecov[bot] commented 4 years ago

Codecov Report

Merging #3881 into master will decrease coverage by 0.59%. The diff coverage is 69.23%.

@@            Coverage Diff            @@
##           master    #3881     +/-   ##
=========================================
- Coverage   85.56%   84.96%   -0.6%     
=========================================
  Files         164      164             
  Lines        9565     9566      +1     
  Branches     1435     1434      -1     
=========================================
- Hits         8184     8128     -56     
- Misses       1133     1182     +49     
- Partials      248      256      +8
Impacted Files Coverage Δ
scrapy/utils/python.py 84.89% <100%> (+0.65%) :arrow_up:
scrapy/core/scraper.py 86.57% <50%> (-1.94%) :arrow_down:
scrapy/loader/__init__.py 92.9% <50%> (-2.47%) :arrow_down:
scrapy/exporters.py 99.14% <60%> (-0.86%) :arrow_down:
scrapy/core/downloader/handlers/s3.py 62.9% <0%> (-32.26%) :arrow_down:
scrapy/utils/boto.py 46.66% <0%> (-26.67%) :arrow_down:
scrapy/core/downloader/tls.py 77.5% <0%> (-12.5%) :arrow_down:
scrapy/extensions/feedexport.py 78.44% <0%> (-5.05%) :arrow_down:
scrapy/core/downloader/handlers/http11.py 89.92% <0%> (-2.62%) :arrow_down:
... and 8 more
codecov[bot] commented 4 years ago

Codecov Report

Merging #3881 into master will increase coverage by 0.14%. The diff coverage is 91.25%.

@@            Coverage Diff             @@
##           master    #3881      +/-   ##
==========================================
+ Coverage   84.63%   84.77%   +0.14%     
==========================================
  Files         163      163              
  Lines        9971     9985      +14     
  Branches     1485     1487       +2     
==========================================
+ Hits         8439     8465      +26     
+ Misses       1266     1255      -11     
+ Partials      266      265       -1     
Impacted Files Coverage Δ
scrapy/spiders/feed.py 66.15% <ø> (ø)
scrapy/commands/parse.py 20.21% <33.33%> (ø)
scrapy/contracts/default.py 84.61% <80.00%> (+0.30%) :arrow_up:
scrapy/core/scraper.py 87.97% <80.00%> (ø)
scrapy/shell.py 67.96% <80.00%> (-0.25%) :arrow_down:
scrapy/exporters.py 100.00% <100.00%> (ø)
scrapy/item.py 98.73% <100.00%> (ø)
scrapy/loader/__init__.py 95.54% <100.00%> (-0.03%) :arrow_down:
scrapy/pipelines/files.py 60.64% <100.00%> (+0.71%) :arrow_up:
scrapy/pipelines/images.py 91.81% <100.00%> (+1.16%) :arrow_up:
... and 8 more
kmike commented 4 years ago

Hey! I think it requires more documentation, because it affects spider middlewares: process_spider_output method can now get dataclass instances. The same applies to item_scraped, item_dropped, item_error signal handlers, and to item pipelines.

If I'm not mistaken, most item pipelines will need to be changed. For example, does FilesPipeline work with dataclasses?

It is a bit trickier than with dict vs Item, because dicts and Items both support item["field"] access, while dataclasses don't, so some code which worked with items and dicts without any explicit checks will stop working for dataclasses.

Check e.g. pipeline examples at https://doc.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline: they should work for items and dicts, but they won't work for dataclasses. We need to make it consistent.

As a minimum, I think we need to

  1. Document how to make your components ready for dataclasses.
  2. After documenting (looking at all steps required), try to simplify it, if possible.
elacuesta commented 4 years ago

Thanks for the review @kmike! I added a decorator to allow dictionary-like access to dataclass objects, a FAQ entry about it and some notes in the signals, pipelines and spider middleware pages; let me know what you think about this approach. I had some problems with the code sample in the docs, sybil.parsers.skip.skip does not seem to be working correctly at 5a97a38 (I also tried with .. skip: start and .. skip: end, to no avail). I ended up ignoring the docs/faq.rst, but I'm not a fan of this, please let me know if you know a way to solve it (cc @Gallaecio, the original author of the Sybil integration).

Gallaecio commented 4 years ago

@elacuesta This seems to do the job: https://github.com/Gallaecio/scrapy/commit/df5a6d2d96d50a997900ce5a9df7f54ab375c709 https://github.com/Gallaecio/scrapy/commit/ea397d34191e6de6212bf53089416d2dc8da0e50

kmike commented 4 years ago

Thanks @elacuesta for the update! It is much more complete now. Remaining concerns:

elacuesta commented 4 years ago

Hey @kmike, thanks for the feedback :smile:

One thing I'd like to discuss about the subscriptable_dataclass thing: I'd prefer to take the burden of making dataclass items Scrapy-ready away from the user as much as possible, i.e. not requiring them to decorate items in order for built-in components to work properly. With that in mind, the media pipelines "natively" handle dataclass items now (tests updated). This is done by converting ìtems to dicts if necessary in their item processing methods, in order to populate such items with the URLs to be downloaded (without doing something like this, kinda hackish IMHO). The implication is that the output of those methods is not of the same type as the input.

Regarding spider middlewares, I added a note to the process_spider_output method docs pointing to the FAQ entry about dataclass items, which explains the implementation and mentions the subscriptable_dataclass decorator, I believe that should be enough.

I checked built-in components such as extensions, pipelines and middlewares, I think nothing else needs updating at this point but another pair of eyes checking this fact are always welcome.

Looking forward to hearing your comments, thanks again!

mredaelli commented 4 years ago

Sorry to barge in.

The implication is that the output of those methods is not of the same type as the input.

Not sure if I understand this correctly: do you mean that the dataclass item is replaces by a dict in all the subsequent pipelines etc?

If so, my personal preference and suggestion would be not to do that.

Part of the reason I asked for dataclass support was exactly so that I could write pipelines and middlewares that could receive a dataclass and all its methods, validation, typechecking with mypy, etc. We have several of those, all currently dealing with loose dicts, and we would love to move them to dataclasses.

If the issue is some missing fields, I wonder if that could be worked around by either asking the user to inherit the dataclass from a base class (as vanilla as possible -- even more than the current Item), or by a different annotation replacing @dataclass which also injects the fields?

elacuesta commented 4 years ago

do you mean that the dataclass item is replaces by a dict in all the subsequent pipelines etc?

Indeed, that's what I meant. I don't love the idea either, but I thought it was a reasonable compromise in order to make dataclass items work out of the box. I guess it's not.

This PR introduces a subscriptable_dataclass decorator, I wanted to avoid requiring it but I think we can probably require in a first version, and eventually relax this requirement. That said, it currently doesn't support adding new fields to the items, I'll need to change that.

mredaelli commented 4 years ago

I wonder if using a base class would be easier instead -- but I haven't even checked the current implementation.

I say that because:

What do you think?

kmike commented 4 years ago

Another magical option, for discussion: make a "proxy" class which provides dict-like access to objects, but modifies the original object. Make a decorator which wraps input items to this proxy objects, and unwraps them in the output. Users can then use this decorator in their middlewares to add support for dataclasses.

elacuesta commented 4 years ago

After checking the current implementation, there seems to be no need to do things differently: unless the item is a dict, both pipelines only populate the item with the result of the download if it already contains a field for it, we can just do exactly that with dataclass items.

elacuesta commented 4 years ago

Hey @kmike, thanks for the thorough review. Indeed, seems like I'm missing a lot of things, I'll get to it ASAP. Thanks again!

Gallaecio commented 4 years ago

While working on improving the documentation, I’m starting to think that we need to rethink the API we are going to provide for handling items (or “item-like objects”).

I don’t think we should have components use type-specific functions like is_dataclass_instance or dataclass_asdict. I’m hoping we can come up with something more generic, so that if we support yet a new type of item later, components only need to be upgraded to support new features (but not, for example, for reading or writing key-value pairs).

Would it make sense to create an item adapter class that components can use to interact with input items without needing to care about the underlying type?

elacuesta commented 4 years ago

@Gallaecio After our recent conversation I went to update the media pipelines and realized there were a lot more checks in other components as well (item loader and exporters, for instance) so I'm sold on the item adapter suggestion. I just pushed one commit with said change (91a9a1721c7860603c37797b74bb075fa03b7ca8). Tests are broken because I haven't adapted them yet but I think the implementation looks much more clean now. There are no dataclass-related checks other than in the scrapy.item and scrapy.utils.decorators modules. Please take a look and let me know what you think, I'll be updating the tests shortly.

kmike commented 4 years ago

@elacuesta one more idea: add ItemAdapter imports to the Scrapy project template, for middlewares / pipelines.

Gallaecio commented 4 years ago

Given the way I’m modifying the documentation, it may make sense to rename is_item_like to something like is_item or is_item_instance. The concept of “item-like” would disappear, in favor of taking about “item types”.

elacuesta commented 4 years ago

That works for me :+1: I think I like is_item better, short and simple. Should I rename it here or do you plan to do it in the docs PR?

Gallaecio commented 4 years ago

I can do it in the documentation PR.

elacuesta commented 4 years ago

@Gallaecio @kmike @wRAR With #4534 merged and itemadapter available on PyPI, this PR should be ready to be reviewed and merged. Let me know if you have any comments.

kmike commented 4 years ago

@elacuesta itemadapter code could be worth reviewing as well (e.g. there are some things in README which we may need to change). How should we do it, what's your opinion?

elacuesta commented 4 years ago

@kmike Of course, please do review itemadapter as well. If necessary, I think it'd be ok to create issues there and link them here. This PR adds itemadapter>=0.0.6 to install_requires, if you find any issues with the README alone I think it's fine to just update it, if code changes are necessary too I can release a new version and update install_requires accordingly.

kmike commented 3 years ago

@elacuesta itemadapter issues created; I've checked the package itself, checked README briefly, and a few other files. Haven't finished reviewing tests.

kmike commented 3 years ago

That's monumental work @elacuesta.

Last comments:

So, +1 to merge after we get a plan for adapter => dict conversion with nested dataclass/attr.s items. Not necessarily solve it, but just ensure we won't need to change APIs later, once we get to this problem.

elacuesta commented 3 years ago

@kmike Bumped itemadapter dependency and updated all dict(ItemAdapter(item)) occurrences to use ItemAdapter.asdict

Gallaecio commented 3 years ago

There’s also https://github.com/scrapy/scrapy/pull/3881#discussion_r428598241

I see there have been changes in those lines, but I see no mention of “slotted” in the changes.

elacuesta commented 3 years ago

Merging then, thanks to everyone involved :heart:

To be clear, we discussed the slots thing recently, and agreed that there is no need to mention it. Field names are enforced, in the sense that (un-slotted) attrs classes allow to define random attributes (as any Python class) but they are not considered fields:

In [1]: import attr

In [2]: @attr.s 
   ...: class AttrsItem: 
   ...:     foo = attr.ib() 
   ...:

In [3]: i = AttrsItem(foo="bar")

In [4]: i.some_attribute = 1

In [5]: attr.asdict(i)
Out[5]: {'foo': 'bar'}
kmike commented 3 years ago

To be clear, we discussed the slots thing recently, and agreed that there is no need to mention it.

@elacuesta I still think slots need to me mentioned, because that's almost the point of having scrapy.Item - to prevent mistakes. With a non-slotted attrs class, a typo in a field name could be even worse than a typo in a dictionary - the data will be silently dropped. scrapy.Item and attr.s with slots prevent this kind of errors; we should be promoting it as a best practice.

But this can be done in another PR, that's not a blocker here.