Investigate a deterministic approach for extracting article meta data

GiancarloFusiello commented 5 years ago

Currently, to get meta data from a JATS article such as the article ID can be difficult because they can be captured in numerous ways.

Article meta data should be retrievable in a deterministic way for Libero Publisher services.

Task

Investigate the best method for extracting JATS article meta data and using a consistent naming convention.

fred-atherden commented 5 years ago

Suggest that we derive article metadata via XSLT 1.0, and capture this in RDF/XML format, which I understand is a standard used by some publishers (for this purpose and others).

A simple XSLT like in the attached would follow a certain logic to define certain metadata. For example the id for an article could be determined in xsl with the following logic:

if article-id[@pub-id-type='publisher-id'] then that
elif elocation-id then that
else doi this id could be found with the XPath //dc:identifier in the output RDF file.

An alternative is to architect a (fit-for-purpose) metadata content model in XML or similar.

fred-atherden commented 5 years ago

Here's an example of a really simple XML model for metadata I just made up (as an alternative to RDF): this XSLT on the XML for this article would output the following:

<contentItem xmlns:libero="https://libero.pub/namespace">
   <article-id>00086</article-id>
   <title>Taxation of married couples in Germany and the UK</title>
   <subtitle>One-Earner Couples Make the Difference</subtitle>
   <doi>10.34196/ijm.00086</doi>
   <subject type="display-channel">Research article</subject>
   <subject type="heading">Taxes and benefits</subject>
   <volume>6</volume>
   <issue>3</issue>
   <fpage>2</fpage>
   <lpage>20</lpage>
   <journal-title>International Journal of Microsimulation</journal-title>
   <issn>1747-5864</issn>
   <publisher>International Microsimulation Association</publisher>
</contentItem>

@GiancarloFusiello, what are your thoughts on this approach (using XSL to output an XML file [RDF or similar] which can be treated as the source of truth for processing)? It would also be useful to know what you mean by article metadata - for example, does this include resolving URIs to assets/figures? There are limitations on what XSLT (1.0) will be able to do in that regard.

What kind of metadata do you need picked out here?

GiancarloFusiello commented 5 years ago

@FAtherden-eLife typically, we would discuss the approaches you've put forward in a meeting with the rest of the team and decide on an implementation approach based on the discussion from that meeting.

Personally, I prefer the second approach for its simplicity which makes it easier to serialise into other formats such as JSON or a database table, for example.

Regarding metadata, should be any information found in the XML that isn't the content itself. We can agree on the names of fields as a team but what you have done so far looks really good to me.

libero / publisher

Investigate a deterministic approach for extracting article meta data #226