inveniosoftware / dojson

Simple pythonic JSON to JSON converter.
https://dojson.readthedocs.io
Other
10 stars 29 forks source link

filter away empty fields/subfields after input #165

Open kaplun opened 8 years ago

kaplun commented 8 years ago

Problem

Currently, utils.filter_values() is filtering away keys and corresponding values from dictionaries where value is None.

This concretely means, e.g. in the context of MARC21 conversion to JSON, that subfields with empty strings would be preserved, datafields with no subfields would be preserved.

Proposal

If we assume that an empty string in the bibliographic metadata context doesn't carry any valuable information, it is proposed that filter_values actually filters away any key whose value is:

According to TIND, @Kennethhole reports:

I can confirm that TIND does not intend to use empty fields. However, it is highly likely that there are empty subfields in our databases and we prefer that dojson don't break due to that! From our point of view, these subfields can be removed during the conversion.

Related to INSPIRE, I can confirm that we have no use for empty values and we internally went further and have implemented a function that recursive visit the whole record and strips away also empty list and empty dicts that result from having filtered values. https://github.com/inspirehep/inspire-next/blob/master/inspirehep/dojson/utils/__init__.py#L206

See also:

tiborsimko commented 8 years ago

... in other words, do we want to support MARC21 records containing "empty fields" such as:

<datafield tag="123" ind1="4" ind2="5">
</datafield>

and "empty subfields" such as:

<datafield tag="123" ind1="4" ind2="5">
  <subfield code="a">Foo</subfield>
   <subfield code="b"></subfield>
</datafield>

or do we want to always remove these empty fields/subfields?

CC @aw-bib @martinkoehler @fjorba @jma @basaglia

CC @inveniosoftware/triagers

aw-bib commented 8 years ago

Just crosschecked with our librarians to be sure not to miss esotheric cases:

As for TINDs comment: our librarians confirmed that e.g. Aleph allows to load empty fields/subfields on ingestion of external data. (I.e. bibupload on the shell.) However, Alephs bibedit would remove any of these fields silently and automatically once a cataloguer opens and stores such a record. That is, even if you deliberately add an empty field/subfield in Alephs bibedit you can not save it. Thus, you can not rely on the fact that an empty field is preserved in this commercial system, simply as soon as a cataloguer touches such a record these fields get stripped. (IMHO Aleph is at least inconsistent here. With a tendency to strip.)

kaplun commented 8 years ago

OK. Given the above and:

jirikuncar commented 8 years ago

Then we should have a specific filter_values decorator just for MARC21. Or simply add new filter for command line that removes empty values.

kaplun commented 8 years ago

Such as the general one we are using in INSPIRE? https://github.com/inspirehep/inspire-next/blob/master/inspirehep/dojson/utils/__init__.py#L245

tiborsimko commented 8 years ago

Yes, I think we can close this RFC to say that empty values in fields/subfields should be "tolerated" on the input upload side, but that we can delete them internally as soon as we spot them.