Open de-code opened 7 years ago
It's a different problem indeed, #160 is for detecting the fact that a token is a subscript or superscript.
Here, the extra space is a consequence of pdf2xml
. In the XML format generated by pdf2xml, normally each token is separated by a space character... except when the font or size of these two tokens are different - then we have two different tokens and we don't know if there is a space between them (general rule) or not.
I think, we could either try to guess if there is a space or not between the two tokens in GROBID by looking at the coordinates and the average character spacing for instance. Or we could tackle that in pdf2xml by introducing maybe an XML attribute that would clarify the spacing.
It's a bit a design issue of the XML format generated by pdf2xml
, I don't know how the ALTO format for instance deals with this kind of case (todo: have a look!).
Just had a look at the ALTO format. It looks like they have a separate SP
tag for that.
(Reference example here: https://github.com/altoxml/reference_samples)
Cermine is using the Trueviz format for training data annotations. It separates 'Zones' into 'Lines' into 'Words' and 'Characters'. ALTO seems similar in a way but is more flexible. So I might use that for my annotated training data. Although the alternative I was considering extended SVG. Peter's pdf2svg converts to svg with character mapping (but leaves the text block detection for the next step). mupdf/tools also has an option to render as SVG (either as paths or text). But neither would add a space element like in ALTO.
Do you know of any tool that can convert PDF to ALTO already?
Thanks! ALTO looks indeed a good choice. It is used a lot by many national libraries because ABBYY FineReader (which is used by most massive digitalization projects) can produce it. I already received requests to make GROBID supporting ALTO as input format.
The alternative I think would be hOCR, produced by some open source OCR like Tesseract. The specification introduces similar areas as ALTO, including a space element ocr_separator
. The problem is that Tesseract for instance, as far as I know, only produces very poor hOCR output without separate space tag, without blocks, etc. so unfortunately the actual tools appear not to exploit/support the whole specification.
hOCR might have the advantage to cover also xhtml - so it could support also html input (interesting for GROBID to support scientific articles in html) and some data present in PDF (annotations and metadata) which are not ouputted by OCR so not in ALTO.
I didn't find tools using the Trueviz format, except Trueviz and CERMINE, and no OCR which limit very significantly its interest.
I saw some commercial tools that can convert PDF to ALTO, but nothing Open Source - there is a project pdf2alto) but it only outputs word element.
I think outputting ALTO format (and/or hOCR) with pdf2xml
as an alternative to the current XML format would not be difficult at all, just a bit time-consuming. It would be a nice addition for the community I think. We need to keep in mind that representing PDF annotations and metadata (as supported by the current pdf2xml
output and GROBID) would suppose to extend a bit the ALTO spec. (so via namespaces).
This is also from the first pubmed manuscript (Introduction): "...suppression of integrin α2 by E7820..."
The 2 after α is in subscript.
Currently an extra space is added: "...suppression of integrin α 2 by E7820..."
This may be related to https://github.com/kermitt2/grobid/issues/160 but then seems a different problem. Just because it's a different font/style may not mean that there should be a space.