kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.48k stars 449 forks source link

Suggestion: Use an XML writer for TEIFormatter #550

Open de-code opened 4 years ago

de-code commented 4 years ago

Currently the TEIFormatter is building up the raw XML. There are a lot of places where the indent is handled, encode values properly etc. Effects might also be that the xmlns="http://www.tei-c.org/ns/1.0" is declared multiple times.

I would recommend an XML writer of some sort. There is already the XMLStreamWriter (no experience with it). But even a very simple wrapper to a the StringBuilder would make it much easier to read. The you could even stream it if you like.

Some benefits I would see with that:

EDIT: Only noticed now that you do have the XmlBuilderUtils. So maybe it's just a case of using that more. I am not sure whether there were any concerns with that.

kermitt2 commented 4 years ago

Thanks Daniel!

This is something we tried to do quite early. The current version is still using xom to build some TEI segments (class XmlBuilderUtils indeed). In other projects, I am using javax.xml.transform.stream.* package to serialize some TEI and LSSerializer from DOM representation. I've never used XMLStreamWriter.

So far my experience was not good, with issues with namespaces badly propagated and appearing repeatedly (xom is actually the cause of the namespaces automatically repeated at various places in the current TEI), formatting hard to control (in particular when mixing text and inline tags), random respect of "space" character (the xml:space="preserve"), and problem for serializing and combining XML segments.

Possibly it's just that I don't know how to use xom properly but I found finally using the stupid string concatenation reliable, controllable and effective, despite the known disadvantages you raised very correctly.

Another point to consider is that converting all the current TEI string concatenation building into a kind of DOM/xom tree to be serialized or using a dedicated XML stream writer would eat a lot of development time without added value to the final user, independently from some risky buggy results.

de-code commented 4 years ago

Using DOM / XOM could be a bit of change. You are probably getting some of the repeated namespace because you serializing parts of the final output separately (but some duplication is handcrafted). It's probably best to use DOM for the whole document. Maybe something that would work well in the interim is to write a simple wrapper around the string concatenation, while still allowing raw strings. That way you would still have full control and could use it where appropriate. And should have a low risk.

It is true to weigh up the benefits. So maybe there is no strong case for changing all of the existing code. Maybe try it for some new extractions (e.g. I could have used it for the raw_affiliations maybe) or things you were going to change anyway. Or @Vitaliy-1 could use it for his JATS support implementation.