RFC: JATS support - Githubissues

thewilkybarkid commented 5 years ago

Problem

Libero's data model is planned to support schemas like JATS (see #11), but developing solely based on Libero's native schema is slow (as it doesn't exist yet). We have to model all the possibilities, which is not amenable to rushing (especially as we don't want to version it). For example, https://github.com/libero/libero/issues/5 would require a lot of schema work.

Users like eLife will need non-JATS content, but IJM only have scholarly content and have already investigated converting their archive to JATS.

Suggestion

Commit to supporting JATS now and prioritise it.
Continue to build-up Libero's schema, based on the JATS support that is implemented (ie as non-blocking, possibly follow-up, work).

Concerns

What JATS to support. JATS4R have been making progress on recommendations but isn't comprehensive and might still be too open. DAR seems too strict.
How to support multiple versions of JATS (and flavours?), including the rumoured 2.0.
How to handle assets. eLife XML just references a TIF (without an actual URI), whereas we'd want a IIIF endpoint.
Complexity of supporting multiple schemas across all services. (This is an existing concern, but doing it now does bring it to the forefront.)

thewilkybarkid commented 5 years ago

Related Browser work at https://github.com/libero/browser/pull/30 and schema changes at https://github.com/libero/schemas/pull/14.

giorgiosironi commented 5 years ago

What JATS to support

Another way to see this issue is: when there is a discrepancy between two services (client/server, upstream/downstream like content-store and search), what is the source of truth to decide? JATS validity according to its DTD/RelaxNG/XSD is by definition too wide, so it seems we need to "build or buy" a schema anyway to validate inputs.

How to handle assets. eLife XML just references a TIF (without an actual URI), whereas we'd want a IIIF endpoint.

I'd assume the JATS used as the original input is not identical to the JATS served by the API, so there can be processing steps that substitute in URLs. The IIIF format is a strong dependency though, as it evolves more frequently than JATS (unstable should depend on stable rather than the other way around) and is not ubiquitous.

Complexity of supporting multiple schemas across all services

If we have a top level Libero element acting as a wrapper, this should be dealt with the same tooling that would validate Libero documents in general. Relies on integrating, for example, the RelaxNG JATS definitions into the schemas so that it fits with the Libero RelaxNG ones that are there now.

thewilkybarkid commented 5 years ago

I'd assume the JATS used as the original input is not identical to the JATS served by the API, so there can be processing steps that substitute in URLs.

Agreed. (Not restricted to JATS too, as all assets will probably get moved around etc.)

If we have a top level Libero element acting as a wrapper, this should be dealt with the same tooling that would validate Libero documents in general. Relies on integrating, for example, the RelaxNG JATS definitions into the schemas so that it fits with the Libero RelaxNG ones that are there now.

Embedding is quite simple. But

Complexity of supporting multiple schemas across all services

was meant to refer to actually using the data, eg Browser being able to convert different types of XML into consistent HTML, Search being able to index different types of XML.

giorgiosironi commented 5 years ago

eg Browser being able to convert different types of XML into consistent HTML, Search being able to index different types of XML.

Examples:

browser would have to read information (e.g. the title) from multiple possible places
search would have to index different kind of services
issues (if it exists) could use mostly metadata rather than using the body of the article.

I guess some very common information that can be used in other services like id and title or authorLine would be extracted into the Libero wrapper elements; services that only need to list or link to an article would be better off, while services that make use of the content have to necessarily support multiple formats. In practice in Continuum we had:

lax: the article store
elife-metrics: collecting views for a certain id
journal-cms: indexing articles to attach cover images
search: indexing all the content
observer: reporting indexes all sort of data e.g. produces an RSS feed

Since the listing-based services would return ids only, they wouldn't necessarily need to know about JATS.

GiancarloFusiello commented 5 years ago

Sorry if this is very basic but I'm trying to understand what is the definition of Libero's data model?

thewilkybarkid commented 5 years ago

@GiancarloFusiello, essentially what's in https://github.com/libero/schemas. Rather than being one big schema, it's broken down into the core (ie the required part, which is as small as possible), then a whole load of extensions that you can enable (so the opposite of JATS, which is one massive schema that you have to cut down to the parts that you want). Currently there's only 1 extension (italic text) along side the required parts (eg content item).

The walking skeleton has this in more detail: a bunch of schemas for different publishers comprised of a set of extensions, but with some customisations. So your schema for your content, sharing where possible but not blocked from doing anything.

giorgiosironi commented 5 years ago

For example, libero/libero#5 would require a lot of schema work.

This is what makes me inclined to :+1: this RFC: we can build the current features now with a borrowed schema (some version of JATS) and introduce a different (Libero) schema when we know more about the complexity of putting all the service together.

de-code commented 5 years ago

What JATS to support. JATS4R have been making progress on recommendations but isn't comprehensive and might still be too open. DAR seems too strict.

Could the default be DAR? It seems DAR will need to be extended to cater for missing use cases. Or do you mean it's too strict in how the structure should look like? If it's the latter, and the JATS served by the API, could we do an up-front transformationstep like Giorgio suggested, to make it DAR?

There may be already existing efforts to normalise JATS. Patrice from GROBID has for example created Pub2TEI (the output here is obviously TEI - which we also converted back to JATS if we wanted to, but might make the pipeline more complicated).

Using TEI altogether could be another option. It may not cater for IJM, although with a translation tool like Pub2TEI it might?

*None of the above is meant to favour one standard over the other.

Having the option to use an existing "standard" seems to make sense to me.

Melissa37 commented 5 years ago

Could the default be DAR? It seems DAR will need to be extended to cater for missing use cases. Or do you mean it's too strict in how the structure should look like? If it's the latter, and the JATS served by the API, could we do an up-front transformation step like Giorgio suggested, to make it DAR?

DAR is very strict and is being developed for a tool for editing and so decisions are made based on getting one product ready for use. Because of the decisions being made for it, it could likely alienate 50% of publishers because their XML decisions would not work in it. Examples being how authors and affiliations are linked to each other.

JATS seems like a good standard to work with as most publishers who create full text are familiar with it and a learning curve to learn a new standard may be off-putting.

IMO :-)

libero / community

RFC: JATS support #21

Problem

Suggestion

Concerns