Closed manycoding closed 5 years ago
I think we can easily accomplish this by adding a Pipeline
to Spidermon, such Pipeline can be automatically enabled when someone enable the ItemValidationPipeline
An example would be something like:
class NoneForMissingFieldsPipeline:
process_item(self, item):
# some work here
class ItemValidationPipeline(NoneForMissingFieldsPipeline):
process_item(self, item):
super().process_item(item)
# some work here
We can have this as a default, I don't think this change will break anything backwards.
But would be nice to have an option to control this SPIDERMON_NONE_FOR_MISSING_FIELDS
for example
I have the tendency to be contrary to Spidermon changing the contents of a returned item. If the spider returned the item without content, I don't think it is Spidermon's job to include it back with None. It may be something desired by the spider developer.
Promote a uniform ideology in the company - (missing field = None or np.nan)
Isn't it too hardcore? Also, those projects (scrapy, spidermon, arche) are meant to be used by other people and organizations with different requirements and use cases.
Then for json schema, it always will be null type - e.g. "type": ["string", "null"]
This could hide some edge cases when the spider is not returning the field.
Here in spidermon in particularly, it will require converting data explicitly to account for this - e.g. [{"availability": 1, "_key": "0"}, {"_key": "1"}}] > [{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]
I believe this is a code smell. The way arche handles data internally is leaking across multiple repositories. It should be transparent.
Isn't it too hardcore?
Promote means encourage. There's always a choice. But I believe we make this NAN choice in most cases, so we can affect it to be more consistent. In cases when we cannot (the requirements are very specific) - that's ok too.
The way arche handles data internally is leaking across multiple repositories.
I don't want bad practices either, and my idea about converting under the hood starts looking like one :) But there're no tricks in Arche
at the moment, it's transparent (at least after the https://github.com/scrapinghub/arche/pull/87 is closed)
I am closing this since it's more about internal coding practice.
Coming from here https://github.com/scrapinghub/arche/issues/83
I would like to treat missing values consistent, but I would also love to keep json schemas work and keep
spidermon
andarche
compatible. By inconsistency I mean that if some field's value is missing, one might discard the field:[{"availability": 1, "_key": "0"}, {"_key": "1"}}]
Or make itNone
[{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]
Empty strings""
are consistent, so no issues here. Either of this approaches of storing missing values requires differentjson schema
(and maybeschematics
too). If usingpandas
, it will just putNAN
in both cases.So, about consistency - it can look like:
None
ornp.nan
)json schema
, it always will benull
type - e.g."type": ["string", "null"]
bad idea ~3. Here in spidermon in particularly, it will require converting data explicitly to account for this - e.g.
[{"availability": 1, "_key": "0"}, {"_key": "1"}}]
>[{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]
~More information here https://github.com/scrapinghub/arche/issues/83