inveniosoftware / invenio

Invenio digital library framework
https://invenio.readthedocs.io
MIT License
625 stars 292 forks source link

BibFormat: problems with format_record() and live incomplete MARCXML #251

Closed tiborsimko closed 10 years ago

tiborsimko commented 10 years ago

Originally on 2010-08-18

There are some problems when using format_record() on MARCXML snippets that do not have all expected fields and/or real record ID (tag 001).

1) A small problem is that recID cannot really be None if MARCXML is passed, since it leads to tracebacks in statements like:

register_exception(prefix="An error occured while formatting record %i in %s" % \
                   (recID, of),
                   alert_admin=True)

We can live with this by passing fake recID, but we should probably document it in the docstring.

2) The real problem is that some output formats such as EndNote and RefWorks seem to assume presence of many fields, which is not the case for e.g. external items in baskets, that have only a handful of fields defined, and do not even have 001.

A simple test case that fails:

z = """<?xml version="1.0" encoding="UTF-8"?>
  <record>
    <controlfield tag="001">1234</controlfield>
    <datafield tag="100" ind1=" " ind2=" ">
      <subfield code="a">Doe, J</subfield>
    </datafield>
    <datafield tag="245" ind1=" " ind2=" ">
      <subfield code="a">On the foo and bar</subfield>
    </datafield>
  </record>"""
from invenio.bibformat import format_record
format_record(1234, 'xe', xml_record=z, on_the_fly=True)

A test value that works:

z = format_record(1,'xm')

but eliminate 001 from the snippet and it will stop working:

z = z.replace('<controlfield tag="001">1</controlfield>\n  ','')

The typical error is:

In [19]: format_record(1234, 'xe', xml_record=z, on_the_fly=True)
Entity: line 2: parser error : Start tag expected, '<' not found

^
Out[19]: '<abbr class="unapi-id" title="1234"></abbr>\n'
tiborsimko commented 10 years ago

Originally on 2010-08-18

More info for clarification. While MARC standard seems to suggest 001 is mandatory, the MARCXML schema seems to allow not having one. Which is suitable for records not living as records, but as MARCXML snippets. So we have either to be prepared for those, or we should require passing fake record ID like 0.

In any case, EndNote and some other output formats currently ignore MARCXML snippet that is being passed via xml_record argument to format_record(), but seem to rely solely on getting the information via recID, which cannot work for snippet-only records such as external basket items. Example:

In [20]: z = "foo bar blah"

In [21]: format_record(1, 'xe', xml_record=z, on_the_fly=True)
Out[22]: '<abbr class="unapi-id" title="1"></abbr>\n<strong>ALEPH experiment: Candidate of Higgs boson production</strong>  / <a href="http://pcuds33.cern.ch/search?ln=en&amp;p=Photolab&amp;f=author">Photolab</a>;  14 06 2000.<br /><small>Candidate for the associated production of the Higgs boson and Z boson. [...]</small><br /><small class="note"><a class="note" href="http://pcuds33.cern.ch/record/1/files/0106015_01.jpg">http://pcuds33.cern.ch/record/1/files/0106015_01.jpg</a></small><br /><small class="note"><a class="note" href="http://pcuds33.cern.ch/record/1/files/0106015_01.gif?subformat=icon">http://pcuds33.cern.ch/record/1/files/0106015_01.gif?subformat=icon</a></small>'

In [23]: format_record(2, 'xe', xml_record=z, on_the_fly=True)
Out[23]: '<abbr class="unapi-id" title="2"></abbr>\n<strong>The first CERN-built module of the barrel section of ATLAS\'s electromagnetic calorimeter</strong>  / <a href="http://pcuds33.cern.ch/search?ln=en&amp;p=Patrice+Lo%C3%AFez&amp;f=author">Patrice Lo\xc3\xafez</a>;  10 Apr 2001.<br /><small>Behind the module, left to right Ralf Huber, Andreas Bies and Jorgen Beck Hansen. [...]</small><br /><small class="note"><a class="note" href="http://pcuds33.cern.ch/record/2/files/0104007_02.jpeg">http://pcuds33.cern.ch/record/2/files/0104007_02.jpeg</a></small><br /><small class="note"><a class="note" href="http://pcuds33.cern.ch/record/2/files/0104007_02.gif?subformat=icon">http://pcuds33.cern.ch/record/2/files/0104007_02.gif?subformat=icon</a></small>'
jeromecaffaro commented 10 years ago

Originally on 2010-08-23

The limitation is mostly due to the support of extension functions in BibFormat XSL: http://invenio-demo.cern.ch/help/admin/bibformat-admin-guide#xslFormatTemplate

-_fn:modificationdate(recID)* fn:creation_date(recID) fn:eval_bibformat(recID, bibformat_template_code)

The first two functions need the recid to retrieve this information in an XSL context. The function could be extended to a) not fail if no recID is given and/or b) retrieve this information from baskets if possible (did they use to have something like negative "recid"?). Sometimes not having this information would anyway not make sense and would produce invalid output (eg. RSS output and its <pubDate> tag)

The last function which lets run any BibFormat template/element in XSL templates is a bit more tricky to fix. Though the recid is just used to instantiate a BibFormatObject (bfo) which could very well be instantiated with an XML snippet too, it might be impossible to access the currently processed XML from the eval_bibformat(..) function. If not possible, one could add a new "marcxml" parameter to the function, which could be provided from the template itself: fn:eval_bibformat(recID, bibformat_template_code, marcxml)

<xsl:value-of select="fn:eval_bibformat(marc:controlfield[@tag='001'],'&lt;BFE_SERVER_INFO var=&quot;recurl&quot;>',marc:.)" />

This might have some impact on speed though, and might not be possible in all cases.

Other alternatives, which can be combined:

  1. whenever an XSL template is processed without a recid, do not process the above extension functions.
  2. Introduce a new type of extension function which does not rely on recid, nor marcxml, which could be used to process data more easily than in XSL (eg. date formatting)
  3. Update the current XSL templates to not use extension functions.
  4. Move back to bfx templates...
tiborsimko commented 10 years ago

Originally on 2010-08-23

Replying to [comment:2 jcaffaro]:

  1. whenever an XSL template is processed without a recid, do not process the above extension functions.

I think a simple and speedy solution of this kind may be sufficient for most use cases. ("do not process non-applicable elements")

Otherwise a more generic solution would be to extend eval_bibformat to accept MARCXML snippet argument, as you proposed, but that would not fully work for extracting non-MARC information anyway. (We would have to introduce more FFT like elements.) I think we don't have to go this way unless somebody has some concrete use cases at hand.

BTW, note that my foo bar blah example in one of the above comments shows that the recID argument takes precedence over xml_record argument, which goes counter the expected behaviour of format_record() as well as counter its docstring. This should be investigated and fixed at the same time.

invenio-developers commented 10 years ago

Originally by Jerome Caffaro jerome.caffaro@cern.ch on 2011-03-25

In [aa54dcd4aee1d0f04934cfbe8f9be06f0247ac8e]:

#CommitTicketReference repository="" revision="aa54dcd4aee1d0f04934cfbe8f9be06f0247ac8e"
BibFormat: fix XSLT formatting of MARCXML snippets

- Fix formatting of MARCXML given as parameter ("xml_record"
  parameter, instead of records specified by ID with "recID"
  parameter) when using XSL templates. (fixes #251)

- Improve docstrings.
invenio-developers commented 10 years ago

Originally by Jerome Caffaro jerome.caffaro@cern.ch on 2012-02-15

In [66b3ff115e2f098b5c2c86d89439d3d0476c6d18]:

#CommitTicketReference repository="" revision="66b3ff115e2f098b5c2c86d89439d3d0476c6d18"
BibFormat: fix XSLT formatting of MARCXML snippets

- Fix formatting of MARCXML given as parameter ("xml_record"
  parameter, instead of records specified by ID with "recID"
  parameter) when using XSL templates. (fixes #251)

- Improve docstrings.
invenio-developers commented 10 years ago

Originally by Jerome Caffaro jerome.caffaro@cern.ch on 2012-08-09

In 66b3ff115e2f098b5c2c86d89439d3d0476c6d18:

#CommitTicketReference repository="" revision="66b3ff115e2f098b5c2c86d89439d3d0476c6d18"
BibFormat: fix XSLT formatting of MARCXML snippets

- Fix formatting of MARCXML given as parameter ("xml_record"
  parameter, instead of records specified by ID with "recID"
  parameter) when using XSL templates. (fixes #251)

- Improve docstrings.
invenio-developers commented 10 years ago

Originally by Jerome Caffaro jerome.caffaro@cern.ch on 2012-08-09

In 66b3ff115e2f098b5c2c86d89439d3d0476c6d18:

#CommitTicketReference repository="" revision="66b3ff115e2f098b5c2c86d89439d3d0476c6d18"
BibFormat: fix XSLT formatting of MARCXML snippets

- Fix formatting of MARCXML given as parameter ("xml_record"
  parameter, instead of records specified by ID with "recID"
  parameter) when using XSL templates. (fixes #251)

- Improve docstrings.