kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.47k stars 448 forks source link

NLM/JATS conversion? #98

Open axfelix opened 8 years ago

axfelix commented 8 years ago

Hi folks,

I'm impressed by how much Grobid's performance has improved lately, but it's difficult for us to use because we mostly target JATS output and whatever flavour of TEI that's output by Grobid doesn't seem to work well with any of the existing TEI -> NLM XSLTs that I can find.

It doesn't seem like it'd be terribly complicated to write a new XSLT and I might be willing to take this on, but could you clarify which version you're targetting?

kermitt2 commented 8 years ago

Hello!

TEI is not providing an unambiguous encoding for a document, but a set of possible encodings. So we are using a TEI customization as defined by the schemas under grobid-home/schemas (dtd, rng and xsd are provided) - the customization indicates which encoding used for which document structure.

For us, TEI is a good ingestion format, because it is very comprehensive and can be extended easily, and we have created a set of style sheets to convert various heterogeneous editor's XML formats (including NLM) into the same TEI as general by GROBID, see the complementary project https://github.com/kermitt2/Pub2TEI

Of course, having the possibility to convert this TEI into NLM/JATS would be a very valuable addition, because JATS is an excellent exchange format! So if you write such an XSLT, this would be a great addition to GROBID or the above Pub2TEI project.

kermitt2 commented 8 years ago

Actually I should have indicated you the documentation which is quite detailed about the GROBID TEI specification: http://grobid.readthedocs.org/en/latest/TEI-encoding-of-results/

Not only that nobody read the doc, I even forget myself that I have written it :dancers:

lfoppiano commented 8 years ago

@axfelix have you got any news on this subject? Would be really nice to have JATS conversion on GROBID and we are happy to assist ;)

axfelix commented 8 years ago

I have a 100 line XSLT just targetting the front matter for now... not the most significant contribution, but it does work :)

I need to get back to this sometime soon.

lfoppiano commented 8 years ago

@axfelix it's a good start. ;-) Feel free to submit a PR for it and we could integrate it (to be checked with @kermitt2).

axfelix commented 8 years ago

https://www.dropbox.com/s/e55mnc7i3stgch0/grobid-jats.xsl?dl=0

Not ready for a PR yet unless you're really keen to integrate it, but I'll let you know when I have more.

lfoppiano commented 8 years ago

HI @axfelix, I've implemented a Transformer that uses your XSLT to tranform TEI to JATS. You can see some preliminary result (the entry point is not committed, because it was just a test):

screen shot 2016-08-11 at 23 00 10

I've also provided a test that using some sample can be helpful for testing purposes.

At the moment, the change is pretty transparent because the transformed is not yet used. Looking at the results, I've seen several small mistakes (e.g. article-title) that would need to be corrected. Would you mind taking care of it?

Once the XSLT is stable will be very easy to integrate it in the REST interface.

lfoppiano commented 6 years ago

Here a version (partial) that can be used https://github.com/elifesciences/sciencebeam/blob/develop/xslt/grobid-jats.xsl

de-code commented 5 years ago

Just thought it's worth mentioning that I have updated the ScienceBeam version to also include the full-text. It extracts table and figure labels / descriptions (not the actual table / figure content).

lfoppiano commented 5 years ago

@de-code perfect!! 👍 I will work on it cause here we need a consistent jats format

de-code commented 5 years ago

@de-code perfect!! I will work on it cause here we need a consistent jats format

Sounds good. When are you likely going to work on it?

It might be worth noting, that there isn't really one version of "JATS" (which is something JATS4R is trying to improve). Currently we are interested in the DAR JATS that works in Texture. Whether that will be your target or not, either way you'll probably need to decide which version of JATS to output.

lfoppiano commented 5 years ago

@de-code I'm flexible to follow the DAR JATS.

kermitt2 commented 5 years ago

Indeed the problem with JATS is that the specifications are too loose and there are often several ways to encode something, and contrary to TEI with ODS, there's no way to further constraint the encoding to a single predictable form. Working with PMC fulltexts at scale is really painful for that I have to say!

The JATS stylesheets in Pub2TEI are a try to normalize all these JATS variants into one single "deterministic" TEI encoding, the same one as GROBID. It's still not really able to cover all the JATS variants around for everything, but it's pretty good I think. The interesting point is that it gives "semantic" correspondences between TEI and JATS encoding.

For testing, it could be nice to check how much the source TEI produced by GROBID can be reproduced by applying Pub2TEI to the JATS conversion. The conversion might be destructive (encoded information in TEI which is lost) and it would be way to identify them.

lfoppiano commented 5 years ago

For testing, it could be nice to check how much the source TEI produced by GROBID can be reproduced by applying Pub2TEI to the JATS conversion. The conversion might be destructive (encoded information in TEI which is lost) and it would be way to identify them.

mmm... I did not understand this part, you mean in Pub2tei there is also a Tei2Jats transformation sheet?

kermitt2 commented 5 years ago

mmm... I did not understand this part, you mean in Pub2tei there is also a Tei2Jats transformation sheet?

no I mean, with a new TEI -> JATS conversion, we could use the existing Pub2TEI for a JATS -> TEI and check what we loose from the first TEI to the second one (because they are both following the same TEI customization).