scrapy / itemadapter

Common interface for data container classes
BSD 3-Clause "New" or "Revised" License
61 stars 13 forks source link

Extend supported classes with 3rd party classes? #42

Closed LaundroMat closed 3 years ago

LaundroMat commented 3 years ago

Hi,

I was wondering it would be feasible to provide an interface to extend the supported classes to e.g.:

The reason I ask is because I'm currently using pydantic for schema validation and it would be nice to be able to return a pydantic BaseModel directly from the spider. (And I can imagine some a similar use case for mongoengine Documents and many other classes too).

Of course, this would introduce unnecessary dependencies so I guess a better way would be to build in extensibility.

I could extend the supported classes myself by subclassing ItemAdapter and extending its methods, but there are a lot of methods to keep track of and it doesn't feel very stable.

Have there been any thoughts on providing an interface to extend the supported classes?

Gallaecio commented 3 years ago

I think it makes sense to support these as part of itemadapter, just as we added support for dataclasses or attrs.

The dependencies should not be an issue. For example, itemadapter supports Scrapy’s Item but does not require scrapy as a dependency. We can enable support based on available packages.

My only suggestion would be to open a second ticket, so that we can keep separate tickets for each suggested new item type, instead of suggesting 2 different types here, as they can (and probably should) be implemented in separate pull requests.

elacuesta commented 3 years ago

@LaundroMat Thanks for the idea. Support for arbitrary types was just merged (#45) and released (v0.2.0). Please try it and give us your feedback.

LaundroMat commented 3 years ago

Thanks for this addition!

I'm not using it in production yet (still adding some tests to my code for the changes this entails), but so far things seem to work fine. I've switched from pydantic to marshmallow meanwhile, so the adapter interface is very timely for me 👍

To be honest, I had to wrap my head around the finer details of the implementation to get things rolling, but that's probably just me. FWIW, those were:

To give an example for that last point: suppose I have a document where url is a required field. In my custom MongoItemAdapter, I have this classmethod:

    from itemadapter import AdapterInterface, ItemAdapter, is_item

    class Webpage(mongoengine.DynamicDocument):
        url = mongoengine.URLField(unique=True, required=True)
        title = mongoengine.StringField()

    class MongoItemAdapter(AdapterInterface):
        ...
       @classmethod
        def is_item(cls, item: Any) -> bool:
            try:
                assert Webpage(**item).validate()
            except mongoengine.errors.ValidationError:
                return False
        ...

    ItemAdapter.ADAPTER_CLASSES.appendleft(MongoItemAdapter)
    item = {'title': 'Something something'}    # url is missing, but is required by Webpage
    assert is_item(item) is False    # will fail, because default item adapters will return True

My custom adapter will fail, but any other adapter will not. So, is validation a responsibility of the adapter? And should I then catch validation errors later (e.g. when saving the document)?

I could also do this:

 ItemAdapter.ADAPTER_CLASSES = [auction_item_adapter]

but using itemadapter would then be unnecessary of course.

I guess what I'm asking here is: is there a way to know what adapters have failed (if validation should be itemadapter's concern of course)?

Gallaecio commented 3 years ago

The ItemAdapter interface is not meant to handle validation, no. is_item will usually be as easy to implement as return isinstance(item, <new class to support>).

In Scrapy, validation would be handled in a spider, a spider middleware or an item pipeline.

LaundroMat commented 3 years ago

Aha thanks to your mentioning return isinstance(item, <new class to support>), I understand the reasoning behind is_item now. Keep up the great work!