inveniosoftware / dojson

Simple pythonic JSON to JSON converter.
https://dojson.readthedocs.io
Other
10 stars 29 forks source link

utils.marc21.create_record does not support partial records #198

Closed david-caro closed 6 years ago

david-caro commented 6 years ago

When creating records, and passing just a snippet of the xml, that is, xml without the <record>..</record> tags, you get a partial response, missing some of the nodes of the xml, for example:

In [3]: print marcxml
        <record>
        <datafield tag="980" ind1=" " ind2=" ">
            <subfield code="a">Published</subfield>
        </datafield>
        <datafield tag="980" ind1=" " ind2=" ">
            <subfield code="a">citeable</subfield>
        </datafield>
        <datafield tag="980" ind1=" " ind2=" ">
            <subfield code="a">HEP</subfield>
            <subfield code="a">NONCORE</subfield>
        </datafield>
        </record>

In [4]: marcxml2 = '\n'.join(marcxml.splitlines()[1:-1])

In [5]: print marcxml2
        <datafield tag="980" ind1=" " ind2=" ">
            <subfield code="a">Published</subfield>
        </datafield>
        <datafield tag="980" ind1=" " ind2=" ">
            <subfield code="a">citeable</subfield>
        </datafield>
        <datafield tag="980" ind1=" " ind2=" ">
            <subfield code="a">HEP</subfield>
            <subfield code="a">NONCORE</subfield>
        </datafield>

In [6]: utils.create_record(marcxml, keep_singletons=False)
Out[6]: 
GroupableOrderedDict([('__order__', ('980__', '980__', '980__')),
                      ('980__',
                       (GroupableOrderedDict([('__order__', ('a',)),
                                              ('a', 'Published')]),
                        GroupableOrderedDict([('__order__', ('a',)),
                                              ('a', 'citeable')]),
                        GroupableOrderedDict([('__order__', ('a', 'a')),
                                              ('a', ('HEP', 'NONCORE'))])))])

In [7]: utils.create_record(marcxml2, keep_singletons=False)
Out[7]: 
GroupableOrderedDict([('__order__', ('980__',)),
                      ('980__',
                       GroupableOrderedDict([('__order__', ('a',)),
                                             ('a', 'Published')]))])

I'm not sure if this is actually an issue, as I'm not sure if having the 'record' tags is a must though. But we were using it to test snippets of code and found that issue :/

david-caro commented 6 years ago

So @michamos just explained this to me, it turns out that valid xml must have only one unique root tag, so the xml without the 'record' entry and several 'datafield' ones, is not really valid xml.