Post-do functions - Githubissues

tomaszgy commented 8 years ago

Proposal

It is proposed to add to doJSON a new feature: the post-do functions which would be called after conversion a document from XML to JSON (after processing all related elements in XML).

Why?

Two use cases from Inspire:

I After every conversion from MARC to JSON we use our util function called clean_record to strip empty values from the record and also remove duplicate elements from the lists. This functions appears many times in tests. If we could configure it somewhere globally, that'd make it easier for us (and would save many lines of code, too)

II Data from legacy are sometimes quite messy. Imagine for example the following bit inside MARC:

<datafield tag="411" ind1=" " ind2=" ">
  <subfield code="n">7</subfield>
</datafield>
<datafield tag="411" ind1=" " ind2=" ">
  <subfield code="a">FPCP</subfield>
</datafield>

Subfield n is a conference series number whereas a stands for conference series name. Now, in an example above, they both refer to the same conference series so we'd like to convert this piece of MARC into key-value pair in our JSON:

{
   "series": {
       "name": "FPCP",
       "number": 7             
   }
}

As far as I understand current doJSON, we cannot retrieve all 411 occurences inside one rule function call which makes above example a bit complicated. The current workaround looks like that: when we encounter a single n or a single a we store them in a temporary field and don't do any other changes to JSON inside the rule function. Once we encounter a single n or a again, we check if the temporary field contains value of the opposite kind and try to couple them and return them coupled from a rule function.

Now, at the end of this process we'd like to get rid of all those temporary fields. The post-do functions would make it possible and easy to configure.

jirikuncar commented 8 years ago

@tomaszgy can you provide an example of a definition and usage of such functions?

Are you aware that you can define your own CLI processor (see a simple example) and use command chaining.

tomaszgy commented 8 years ago

@jirikuncar, sure.

Please see the definition and few examples of usage (1, 2, 3) of clean_record function.

Also, speaking of the II example I posted above, an example of a functions would look like that:

def clean_fields(self, fields=None):
   if fields:
      for field in fields:
         self.pop(field, None)

and example of usage would look like that:

def test_series_and_series_number_from_411_n_and_411_a():
    snippet = (
        '<record>'
        '  <datafield tag="411" ind1=" " ind2=" ">'
        '    <subfield code="n">3</subfield>'
        '  </datafield>'
        '  <datafield tag="411" ind1=" " ind2=" ">'
        '    <subfield code="a">Gordon</subfield>'
        '  </datafield>'
        '</record>'
    )  # record/972145

    result = clean_fields(clean_record(conferences.do(create_record(snippet))),
                                     fields=["_loose_411_elements"])

    ...

Thanks for an example of CLI processor and command chaining. Unfortunately, we're not using CLI in Inspire so I am afraid it does not solve our problem.

jirikuncar commented 8 years ago

I'm sorry I just don't get the example. It looks like standard function chaining that doesn't require any change in DoJSON API.

jacquerie commented 8 years ago

Sure, chaining functions works, but it adds boilerplate in tests and wherever one uses a DoJSON conversion.

What @tomaszgy is asking is a single point where we define a series of functions that are automatically run in succession when a certain set of DoJSON rules has finished executing. It doesn't require a change in the API, it is an extra feature that we would have liked while writing several thousands of lines of DoJSON rules and tests.

jirikuncar commented 8 years ago

single point where we define a series of functions that are automatically run in succession when a certain set of DoJSON rules has finished executing.

@tomaszgy @jacquerie I would recommend you to subclass Overdo.do method.

jacquerie commented 8 years ago

If anybody else needs this feature, here's a way to build it: https://github.com/inspirehep/inspire-next/blob/d1c1f1d0edb4a404eb9ebf358c2ef57b6932260a/inspirehep/dojson/model.py and here's a way to use it: https://github.com/inspirehep/inspire-next/blob/d1c1f1d0edb4a404eb9ebf358c2ef57b6932260a/inspirehep/dojson/conferences/model.py.

inveniosoftware / dojson

Post-do functions #173

Proposal

Why?