Closed tomaszgy closed 8 years ago
@tomaszgy can you provide an example of a definition and usage of such functions?
Are you aware that you can define your own CLI processor (see a simple example) and use command chaining.
@jirikuncar, sure.
Please see the definition and few examples of usage (1, 2, 3) of clean_record
function.
Also, speaking of the II example I posted above, an example of a functions would look like that:
def clean_fields(self, fields=None):
if fields:
for field in fields:
self.pop(field, None)
and example of usage would look like that:
def test_series_and_series_number_from_411_n_and_411_a():
snippet = (
'<record>'
' <datafield tag="411" ind1=" " ind2=" ">'
' <subfield code="n">3</subfield>'
' </datafield>'
' <datafield tag="411" ind1=" " ind2=" ">'
' <subfield code="a">Gordon</subfield>'
' </datafield>'
'</record>'
) # record/972145
result = clean_fields(clean_record(conferences.do(create_record(snippet))),
fields=["_loose_411_elements"])
...
Thanks for an example of CLI processor and command chaining. Unfortunately, we're not using CLI in Inspire so I am afraid it does not solve our problem.
I'm sorry I just don't get the example. It looks like standard function chaining that doesn't require any change in DoJSON API.
Sure, chaining functions works, but it adds boilerplate in tests and wherever one uses a DoJSON conversion.
What @tomaszgy is asking is a single point where we define a series of functions that are automatically run in succession when a certain set of DoJSON rules has finished executing. It doesn't require a change in the API, it is an extra feature that we would have liked while writing several thousands of lines of DoJSON rules and tests.
single point where we define a series of functions that are automatically run in succession when a certain set of DoJSON rules has finished executing.
@tomaszgy @jacquerie I would recommend you to subclass Overdo.do
method.
If anybody else needs this feature, here's a way to build it: https://github.com/inspirehep/inspire-next/blob/d1c1f1d0edb4a404eb9ebf358c2ef57b6932260a/inspirehep/dojson/model.py and here's a way to use it: https://github.com/inspirehep/inspire-next/blob/d1c1f1d0edb4a404eb9ebf358c2ef57b6932260a/inspirehep/dojson/conferences/model.py.
Proposal
It is proposed to add to doJSON a new feature: the post-do functions which would be called after conversion a document from XML to JSON (after processing all related elements in XML).
Why?
Two use cases from Inspire:
I After every conversion from MARC to JSON we use our util function called
clean_record
to strip empty values from the record and also remove duplicate elements from the lists. This functions appears many times in tests. If we could configure it somewhere globally, that'd make it easier for us (and would save many lines of code, too)II Data from legacy are sometimes quite messy. Imagine for example the following bit inside MARC:
Subfield
n
is a conference series number whereasa
stands for conference series name. Now, in an example above, they both refer to the same conference series so we'd like to convert this piece of MARC into key-value pair in our JSON:As far as I understand current doJSON, we cannot retrieve all
411
occurences inside one rule function call which makes above example a bit complicated. The current workaround looks like that: when we encounter a singlen
or a singlea
we store them in a temporary field and don't do any other changes to JSON inside the rule function. Once we encounter a singlen
ora
again, we check if the temporary field contains value of the opposite kind and try to couple them and return them coupled from a rule function.Now, at the end of this process we'd like to get rid of all those temporary fields. The post-do functions would make it possible and easy to configure.