NAN consistency in JSON schema, pandas and numpy

manycoding commented 5 years ago

Coming from here https://github.com/scrapinghub/arche/issues/83

I would like to treat missing values consistent, but I would also love to keep json schemas work and keep spidermon and arche compatible. By inconsistency I mean that if some field's value is missing, one might discard the field: [{"availability": 1, "_key": "0"}, {"_key": "1"}}] Or make it None [{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}] Empty strings "" are consistent, so no issues here. Either of this approaches of storing missing values requires different json schema (and maybe schematics too). If using pandas, it will just put NAN in both cases.

So, about consistency - it can look like:

Promote a uniform ideology in the company - (missing field = None or np.nan)
Then for json schema, it always will be null type - e.g. "type": ["string", "null"]

bad idea ~3. Here in spidermon in particularly, it will require converting data explicitly to account for this - e.g. [{"availability": 1, "_key": "0"}, {"_key": "1"}}] > [{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]~

More information here https://github.com/scrapinghub/arche/issues/83

raphapassini commented 5 years ago

I think we can easily accomplish this by adding a Pipeline to Spidermon, such Pipeline can be automatically enabled when someone enable the ItemValidationPipeline

An example would be something like:


class NoneForMissingFieldsPipeline:
        process_item(self, item):
             # some work here

class ItemValidationPipeline(NoneForMissingFieldsPipeline):
        process_item(self, item):
             super().process_item(item)
             # some work here

We can have this as a default, I don't think this change will break anything backwards. But would be nice to have an option to control this SPIDERMON_NONE_FOR_MISSING_FIELDS for example

rennerocha commented 5 years ago

I have the tendency to be contrary to Spidermon changing the contents of a returned item. If the spider returned the item without content, I don't think it is Spidermon's job to include it back with None. It may be something desired by the spider developer.

victor-torres commented 5 years ago

Promote a uniform ideology in the company - (missing field = None or np.nan)

Isn't it too hardcore? Also, those projects (scrapy, spidermon, arche) are meant to be used by other people and organizations with different requirements and use cases.

Then for json schema, it always will be null type - e.g. "type": ["string", "null"]

This could hide some edge cases when the spider is not returning the field.

Here in spidermon in particularly, it will require converting data explicitly to account for this - e.g. [{"availability": 1, "_key": "0"}, {"_key": "1"}}] > [{"availability": 1, "_key": "0"}, {"availability": None, "_key": "1"}}]

I believe this is a code smell. The way arche handles data internally is leaking across multiple repositories. It should be transparent.

manycoding commented 5 years ago

Isn't it too hardcore?

Promote means encourage. There's always a choice. But I believe we make this NAN choice in most cases, so we can affect it to be more consistent. In cases when we cannot (the requirements are very specific) - that's ok too.

The way arche handles data internally is leaking across multiple repositories.

I don't want bad practices either, and my idea about converting under the hood starts looking like one :) But there're no tricks in Arche at the moment, it's transparent (at least after the https://github.com/scrapinghub/arche/pull/87 is closed)

manycoding commented 5 years ago

I am closing this since it's more about internal coding practice.

scrapinghub / spidermon

NAN consistency in JSON schema, pandas and numpy #161