scrapy / itemloaders

Library to populate items using XPath and CSS with a convenient API
BSD 3-Clause "New" or "Revised" License
45 stars 16 forks source link

Fix empty __init__ limitation for dataclasses #14

Open ejulio opened 4 years ago

ejulio commented 4 years ago

Hm, that's an interesting workaround. It means that user can't create Product(name='Plasma Tv', price=999.98) manually. I'm not sure I understand all consequences. It seems we don't test init=False dataclasses in itemadapter (//cc @elacuesta), I'm not sure what's the behavior e.g. with serialization, if fields are missing.

Overall, this looks like an issue in itemloader which can be fixed, not a problem with dataclasses or itemadapter. Fixing it is out of scope for this PR though.

What do you think about asking to provide default field values instead, and saying "currently" when explaining this limitation? We may want to lift it in future.

Originally posted by @kmike in https://github.com/scrapy/itemloaders/pull/13

ejulio commented 4 years ago

This issue may provide some other ideas as well https://github.com/scrapy/itemadapter/issues/30

ejulio commented 4 years ago

Refer to https://github.com/scrapy/itemloaders/pull/13#discussion_r444905860 for a few ideas

ejulio commented 4 years ago

Quoting @kmike here (from https://github.com/scrapy/itemloaders/pull/13#discussion_r445184176)

About the change to use dict internally, we expose it to the outside as a property item in ItemLoader. So, need to be careful that it will change the interface a bit..

Not necessarily - .item property can be computed on fly. What I'm proposing is that .load_item() returns an item of the type you asked, but if a dataclass or attr.s is default_item_class, then their field meta is inspected, but the item is actually instantiated when .load_item() is called, not immediately. From the user point of view the API should stay the same (I hope - haven't looked in detail), but the item of default_item_class is only created towards the end of the extraction, not in the beginning of the extraction. This allows item to exist in "partial", unfinished stage, until extraction is done and .load_item is called. In this case it can be good that creation of the item fails on missing fields, e.g. because some required field was not added via .add_value or similar methods. This change requires careful inspection of the Item Loaders API, tests, documentation, etc. - it is a huge change, which we would have to try hard to make backwards compatible. I think it is out of scope for this PR.

ejulio commented 4 years ago

Just adding a new bit of info. An item is created in __init__ and returned in load_item, but the user can access it through item property in ItemLoader. Also, the item is set to the context, right after it is created (https://github.com/scrapy/itemloaders/blob/master/itemloaders/__init__.py#L96)... So, keep this is mind while working on this issue

Gallaecio commented 4 years ago

Do we consider this a bug report or a feature request? :slightly_smiling_face:

ejulio commented 4 years ago

I guess it goes down to the change. If we decide to change the API to accommodate this limitation, then a feature request. If we consider the API is fine and it should work with this dataclass limitation, then a bug report.