Closed elacuesta closed 3 years ago
Minimal spider, with feed export support:
from dataclasses import dataclass
from scrapy import Spider
@dataclass
class InventoryItem:
name: str
price: float
stock: int
class DataClassSpider(Spider):
name = 'example'
start_urls = ['https://example.org']
def parse(self, response):
return InventoryItem('Chair', 120.5, 10)
$ scrapy runspider example.py -o items.json
(...)
2019-07-15 13:16:06 [scrapy.core.engine] INFO: Spider opened
2019-07-15 13:16:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-15 13:16:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-15 13:16:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.org> (referer: None)
2019-07-15 13:16:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.org>
InventoryItem(name='Chair', price=120.5, stock=10)
2019-07-15 13:16:06 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-15 13:16:06 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: items.json
(...)
$ cat items.json
[
{"name": "Chair", "price": 120.5, "stock": 10}
]
Merging #3881 into master will decrease coverage by
0.59%
. The diff coverage is69.23%
.
@@ Coverage Diff @@
## master #3881 +/- ##
=========================================
- Coverage 85.56% 84.96% -0.6%
=========================================
Files 164 164
Lines 9565 9566 +1
Branches 1435 1434 -1
=========================================
- Hits 8184 8128 -56
- Misses 1133 1182 +49
- Partials 248 256 +8
Impacted Files | Coverage Δ | |
---|---|---|
scrapy/utils/python.py | 84.89% <100%> (+0.65%) |
:arrow_up: |
scrapy/core/scraper.py | 86.57% <50%> (-1.94%) |
:arrow_down: |
scrapy/loader/__init__.py | 92.9% <50%> (-2.47%) |
:arrow_down: |
scrapy/exporters.py | 99.14% <60%> (-0.86%) |
:arrow_down: |
scrapy/core/downloader/handlers/s3.py | 62.9% <0%> (-32.26%) |
:arrow_down: |
scrapy/utils/boto.py | 46.66% <0%> (-26.67%) |
:arrow_down: |
scrapy/core/downloader/tls.py | 77.5% <0%> (-12.5%) |
:arrow_down: |
scrapy/extensions/feedexport.py | 78.44% <0%> (-5.05%) |
:arrow_down: |
scrapy/core/downloader/handlers/http11.py | 89.92% <0%> (-2.62%) |
:arrow_down: |
... and 8 more |
Merging #3881 into master will increase coverage by
0.14%
. The diff coverage is91.25%
.
@@ Coverage Diff @@
## master #3881 +/- ##
==========================================
+ Coverage 84.63% 84.77% +0.14%
==========================================
Files 163 163
Lines 9971 9985 +14
Branches 1485 1487 +2
==========================================
+ Hits 8439 8465 +26
+ Misses 1266 1255 -11
+ Partials 266 265 -1
Impacted Files | Coverage Δ | |
---|---|---|
scrapy/spiders/feed.py | 66.15% <ø> (ø) |
|
scrapy/commands/parse.py | 20.21% <33.33%> (ø) |
|
scrapy/contracts/default.py | 84.61% <80.00%> (+0.30%) |
:arrow_up: |
scrapy/core/scraper.py | 87.97% <80.00%> (ø) |
|
scrapy/shell.py | 67.96% <80.00%> (-0.25%) |
:arrow_down: |
scrapy/exporters.py | 100.00% <100.00%> (ø) |
|
scrapy/item.py | 98.73% <100.00%> (ø) |
|
scrapy/loader/__init__.py | 95.54% <100.00%> (-0.03%) |
:arrow_down: |
scrapy/pipelines/files.py | 60.64% <100.00%> (+0.71%) |
:arrow_up: |
scrapy/pipelines/images.py | 91.81% <100.00%> (+1.16%) |
:arrow_up: |
... and 8 more |
Hey! I think it requires more documentation, because it affects spider middlewares: process_spider_output method can now get dataclass instances. The same applies to item_scraped, item_dropped, item_error signal handlers, and to item pipelines.
If I'm not mistaken, most item pipelines will need to be changed. For example, does FilesPipeline work with dataclasses?
It is a bit trickier than with dict vs Item, because dicts and Items both support item["field"]
access, while dataclasses don't, so some code which worked with items and dicts without any explicit checks will stop working for dataclasses.
Check e.g. pipeline examples at https://doc.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline: they should work for items and dicts, but they won't work for dataclasses. We need to make it consistent.
As a minimum, I think we need to
Thanks for the review @kmike! I added a decorator to allow dictionary-like access to dataclass objects, a FAQ entry about it and some notes in the signals, pipelines and spider middleware pages; let me know what you think about this approach.
I had some problems with the code sample in the docs, sybil.parsers.skip.skip
does not seem to be working correctly at 5a97a38 (I also tried with .. skip: start
and .. skip: end
, to no avail). I ended up ignoring the docs/faq.rst
, but I'm not a fan of this, please let me know if you know a way to solve it (cc @Gallaecio, the original author of the Sybil integration).
Thanks @elacuesta for the update! It is much more complete now. Remaining concerns:
@subscriptable_dataclass
. I think that's a good idea to have something like that, to make items more compatible with existing pipelines, but it may be not enough. Hey @kmike, thanks for the feedback :smile:
One thing I'd like to discuss about the subscriptable_dataclass
thing: I'd prefer to take the burden of making dataclass items Scrapy-ready away from the user as much as possible, i.e. not requiring them to decorate items in order for built-in components to work properly. With that in mind, the media pipelines "natively" handle dataclass items now (tests updated). This is done by converting ìtem
s to dict
s if necessary in their item processing methods, in order to populate such items with the URLs to be downloaded (without doing something like this, kinda hackish IMHO). The implication is that the output of those methods is not of the same type as the input.
Regarding spider middlewares, I added a note to the process_spider_output
method docs pointing to the FAQ entry about dataclass items, which explains the implementation and mentions the subscriptable_dataclass
decorator, I believe that should be enough.
I checked built-in components such as extensions, pipelines and middlewares, I think nothing else needs updating at this point but another pair of eyes checking this fact are always welcome.
Looking forward to hearing your comments, thanks again!
Sorry to barge in.
The implication is that the output of those methods is not of the same type as the input.
Not sure if I understand this correctly: do you mean that the dataclass item is replaces by a dict in all the subsequent pipelines etc?
If so, my personal preference and suggestion would be not to do that.
Part of the reason I asked for dataclass support was exactly so that I could write pipelines and middlewares that could receive a dataclass and all its methods, validation, typechecking with mypy, etc. We have several of those, all currently dealing with loose dicts, and we would love to move them to dataclasses.
If the issue is some missing fields, I wonder if that could be worked around by either asking the user to inherit the dataclass from a base class (as vanilla as possible -- even more than the current Item
), or by a different annotation replacing @dataclass
which also injects the fields?
do you mean that the dataclass item is replaces by a dict in all the subsequent pipelines etc?
Indeed, that's what I meant. I don't love the idea either, but I thought it was a reasonable compromise in order to make dataclass items work out of the box. I guess it's not.
This PR introduces a subscriptable_dataclass
decorator, I wanted to avoid requiring it but I think we can probably require in a first version, and eventually relax this requirement. That said, it currently doesn't support adding new fields to the items, I'll need to change that.
I wonder if using a base class would be easier instead -- but I haven't even checked the current implementation.
I say that because:
scrapy
would make that probably much harder.What do you think?
Another magical option, for discussion: make a "proxy" class which provides dict-like access to objects, but modifies the original object. Make a decorator which wraps input items to this proxy objects, and unwraps them in the output. Users can then use this decorator in their middlewares to add support for dataclasses.
After checking the current implementation, there seems to be no need to do things differently: unless the item is a dict
, both pipelines only populate the item with the result of the download if it already contains a field for it, we can just do exactly that with dataclass items.
Hey @kmike, thanks for the thorough review. Indeed, seems like I'm missing a lot of things, I'll get to it ASAP. Thanks again!
While working on improving the documentation, I’m starting to think that we need to rethink the API we are going to provide for handling items (or “item-like objects”).
I don’t think we should have components use type-specific functions like is_dataclass_instance
or dataclass_asdict
. I’m hoping we can come up with something more generic, so that if we support yet a new type of item later, components only need to be upgraded to support new features (but not, for example, for reading or writing key-value pairs).
Would it make sense to create an item adapter class that components can use to interact with input items without needing to care about the underlying type?
@Gallaecio After our recent conversation I went to update the media pipelines and realized there were a lot more checks in other components as well (item loader and exporters, for instance) so I'm sold on the item adapter suggestion.
I just pushed one commit with said change (91a9a1721c7860603c37797b74bb075fa03b7ca8). Tests are broken because I haven't adapted them yet but I think the implementation looks much more clean now. There are no dataclass-related checks other than in the scrapy.item
and scrapy.utils.decorators
modules.
Please take a look and let me know what you think, I'll be updating the tests shortly.
@elacuesta one more idea: add ItemAdapter imports to the Scrapy project template, for middlewares / pipelines.
Given the way I’m modifying the documentation, it may make sense to rename is_item_like
to something like is_item
or is_item_instance
. The concept of “item-like” would disappear, in favor of taking about “item types”.
That works for me :+1: I think I like is_item
better, short and simple. Should I rename it here or do you plan to do it in the docs PR?
I can do it in the documentation PR.
@Gallaecio @kmike @wRAR With #4534 merged and itemadapter
available on PyPI, this PR should be ready to be reviewed and merged. Let me know if you have any comments.
@elacuesta itemadapter code could be worth reviewing as well (e.g. there are some things in README which we may need to change). How should we do it, what's your opinion?
@kmike Of course, please do review itemadapter
as well. If necessary, I think it'd be ok to create issues there and link them here. This PR adds itemadapter>=0.0.6
to install_requires
, if you find any issues with the README alone I think it's fine to just update it, if code changes are necessary too I can release a new version and update install_requires
accordingly.
@elacuesta itemadapter issues created; I've checked the package itself, checked README briefly, and a few other files. Haven't finished reviewing tests.
That's monumental work @elacuesta.
Last comments:
So, +1 to merge after we get a plan for adapter => dict conversion with nested dataclass/attr.s items. Not necessarily solve it, but just ensure we won't need to change APIs later, once we get to this problem.
@kmike Bumped itemadapter
dependency and updated all dict(ItemAdapter(item))
occurrences to use ItemAdapter.asdict
There’s also https://github.com/scrapy/scrapy/pull/3881#discussion_r428598241
I see there have been changes in those lines, but I see no mention of “slotted” in the changes.
Merging then, thanks to everyone involved :heart:
To be clear, we discussed the slots thing recently, and agreed that there is no need to mention it. Field names are enforced, in the sense that (un-slotted) attrs
classes allow to define random attributes (as any Python class) but they are not considered fields:
In [1]: import attr
In [2]: @attr.s
...: class AttrsItem:
...: foo = attr.ib()
...:
In [3]: i = AttrsItem(foo="bar")
In [4]: i.some_attribute = 1
In [5]: attr.asdict(i)
Out[5]: {'foo': 'bar'}
To be clear, we discussed the slots thing recently, and agreed that there is no need to mention it.
@elacuesta I still think slots need to me mentioned, because that's almost the point of having scrapy.Item - to prevent mistakes. With a non-slotted attrs class, a typo in a field name could be even worse than a typo in a dictionary - the data will be silently dropped. scrapy.Item and attr.s with slots prevent this kind of errors; we should be promoting it as a best practice.
But this can be done in another PR, that's not a blocker here.
Fixes #3761, closes #2807
It seems to me like there would be a limitation in the way we could use dataclassess with ItemLoaders. Given the fact that all fields are supposed to be provided (except for those with default values, or unless a custom constuctor is provided) and that loaders try to append values, I think only dataclasses defined as the following would work:
Thanks in advance for your feedback!
Update
Checklist:
This patch works with the built-in
dataclassess
module in Python 3.7+, and with thedataclassess
backport in Python 3.6.Update 2 Checklist for the new implementation based on the
ItemAdapter
classUpdate 3: Supporting
python-attrs
now too. Implementation is based on https://github.com/scrapy/itemadapter.