stencila / encoda

↔️ A format converter for Stencila documents
https://stencila.github.io/encoda/
Apache License 2.0
35 stars 9 forks source link

JATS: Plain text citation #873

Open rgieseke opened 3 years ago

rgieseke commented 3 years ago

Mixed citations (e.g. originally from bibitems in LaTeX) are not parsed when reading from JATS-XML.

I think the mixed-content should actually be mixed-citation: https://github.com/stencila/encoda/blob/master/src/codecs/jats/index.ts#L1165

Even nicer would probably be to have all elements of the mixed-citation as inline elements to keep e.g. parts in italics.

<mixed-citation>Norman de Plume <italic>The book of XML problems</italic>. XtraPress 2021.</mixed-citation>

Maybe this could be filed as description or comment?

https://schema.stenci.la/creativework

nokome commented 3 years ago

I think the ideal way to handle a <mixed-citation> would be to try to "decode" it (ie. parse it) into a CreativeWork. If we do that then in text Cite nodes will work as expected (ie. show authors and years if needed).

With the fix that you made the entirety of the <mixed-citation>

type: Article
id: pone-0091296-Choat1
authors: []
title: >-
      Choat JH (2012) Spawning aggregations in reef fishes; ecological and
      evolutionary processes. In: Sadovy de Mitcheson Y, Colin PL, editors. Reef
      Fish Spawning Aggregations: Biology, Research and Management. Heidelberg:
      Springer. pp. 85–116.

whereas what we want is the bibliographic info to be parsed out of the <mixed-citation> into

type: CreativeWork
authors:
  - type: Person
    familyNames:
      - Choat
    givenNames:
      - John Howard
datePublished:
  type: Date
  value: '2011-09-20'
identifiers:
  - type: PropertyValue
    name: doi
    propertyID: https://registry.identifiers.org/registry/doi
    value: 10.1007/978-94-007-1980-4_4
isPartOf:
  type: Periodical
  name: 'Reef Fish Spawning Aggregations: Biology, Research and Management'
publisher:
  type: Organization
  name: Springer Netherlands
title: Spawning Aggregations in Reef Fishes; Ecological and Evolutionary Processes
url: http://dx.doi.org/10.1007/978-94-007-1980-4_4

In Encoda, rather than trying to parse references into a CreativeWork, we take the approach suggested here and query CrossRef for bibliographic info. I didn't write the above YAML out by hand but rather used the crossref codec:

./encoda convert "Choat JH (2012) Spawning aggregations in reef fishes; evolutionary processes." --from crossref - --to yaml

I suggest that we use this approach for JATS <mixed-citation> (as we do in the reshape function). However, I think it would be wise to perhaps put it in name or alternateNames or similar (I think description should be avoided because that is where the abstract goes and in some cases we actually have that; and comment has a different semantic structure) and then do the CrossRef querying as a separate enrichment step that won't cause a failure, if for instance there is no network connection.

stencila-ci commented 3 years ago

:tada: This issue has been resolved in version 0.111.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

rgieseke commented 3 years ago

I suggest that we use this approach for JATS (as we do in the reshape function). However, I think it would be wise to perhaps put it in name or alternateNames or similar (I think description should be avoided because that is where the abstract goes and in some cases we actually have that; and comment has a different semantic structure) and then do the CrossRef querying as a separate enrichment step that won't cause a failure, if for instance there is no network connection.

Yes, i was mistakenly thinking that description was belonging to the citation and not the entire creativeWork. The CrossRef querying approach sounds great, how could that work? Should it be an extra conversion? JATS to CrossRef enhanced JATS? Or should it be tried in the JATS codec?

nokome commented 3 years ago

Should it be an extra conversion? JATS to CrossRef enhanced JATS?

Yes, that is what I advocating for above. It shouldn't be part of the decode method of the JatsCodec but rather part of a generic function which can be applied to references of any Article no matter which format it originated from. That is exactly what currently happens here in the reshape function but it is currently "converting" paragraphs into CreativeWorks using CrossRef:

https://github.com/stencila/encoda/blob/52b872ab4f600a7227b420424a8c59c2cd0305ef/src/util/reshape.ts#L341-L370

I think this code should be factored our into a separate enrich function and applied to Paragraphs in the references section, but also to string items in the references property of any CreativeWorks (I had forgotten that string is a valid item in references).

In summary, what needs to happen if we take this direction is: