kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.48k stars 449 forks source link

Processing XML file from pdfalto #553

Open EysiW opened 4 years ago

EysiW commented 4 years ago

I am currently doing a project using pdfalto and I am curious to see how you use the xml-files produced by pdfalto in the GROBID project.

kermitt2 commented 4 years ago

Hello @EysiW ! you can have a look at the ALTO XML parser -> https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/sax/PDFALTOSaxHandler.java Tokens are represented by LayoutToken objects (https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/layout/LayoutToken.java) which gather text and layout information for each tokens, and which are used in the machine learning sequence labelling.

EysiW commented 4 years ago

Do I understand it correctly if each token corresponds to a "STRING" from pdfalto? I have been using the attributes of "TextLine" in my classification problem but are considering to go a level deeper and consider each word with corresponding attributes

kermitt2 commented 4 years ago

Do I understand it correctly if each token corresponds to a "STRING" from pdfalto?

yes

I have been using the attributes of "TextLine" in my classification problem but are considering to go a level deeper and consider each word with corresponding attributes

If token-level attribute specific information are useful for your classification task, this makes sense, but it makes also sense because in the ALTO files that pdfalto generates, the STYLEREFS information are attached to String, and not at TextLine or TextBlock level (because the line or block could contains several different STYLEREFS).

What is not supported for the moment in pdfalto ALTO output files are character/glyph level representation (requested a lot, see https://github.com/kermitt2/pdfalto/issues/89 !).

EysiW commented 4 years ago

Do you have an example of the structure you save your tokens before using it in sequential labeling?

My application is parsing pdf procurement on unpredictable format ( extracting titles/subtitles with corresponding bodytexts)

I do really appreciate your time and comment.

kermitt2 commented 4 years ago

LayoutToken as indicated above https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/layout/LayoutToken.java

There are several related classes, in particular https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/utilities/LayoutTokensUtil.java

and in the related packages for processing LayoutToken:

https://github.com/kermitt2/grobid/tree/master/grobid-core/src/main/java/org/grobid/core/tokenization https://github.com/kermitt2/grobid/tree/master/grobid-core/src/main/java/org/grobid/core/layout

EysiW commented 4 years ago

Does the .raw files in https://github.com/kermitt2/grobid/tree/master/grobid-trainer/resources/dataset/fulltext/corpus correspond to a preprocessed document, and the .tei files correspond to the output after CRF? Where can I find more details on where CRF does the sequential labeling?

kermitt2 commented 4 years ago

.raw files are providing all the features to the sequence labelling method (CRF or Deep Learning). The features are created usually from the text (the text of the LayoutToken object) and when relevant the layout information found in the LayoutToken object.

The .tei files are not the output of Grobid. These TEI files are used to add the expected label to the raw files to create full training data. These TEI files follow exactly the text sequence as outputted by pdfalto. This is a mirror of the pdf to be labelled.

The TEI files outputted by Grobid are representations of the logical structures of the document, not a particular presentation as found in a PDF. So the resulting TEI are normalized and ignore particular presentation information (like page number). It is not possible to represent in the same TEI document both the logical structure of a document and a particular representation.

For the sequence labelling task, this is abstracted in the package org.grobid.core.engines.tagging, so that we can use different sequence labelling libraries transparently (see org.grobid.core.jni for the actual integration).

EysiW commented 4 years ago

How do you conduct the classification? Do you classify token by token?

Did you try/consider classifying row by row or block by block? in either case, why/ why not?

kermitt2 commented 4 years ago

How do you conduct the classification?

This is a sequence labelling task. The reference section of the GROBID documentation points to several papers on the particular tasks that Grobid tries to realize.

Do you classify token by token? Did you try/consider classifying row by row or block by block? in either case, why/ why not?

Granularity of each sequence labelling model is based on the size of the input sequence, complexity, feature selection and a lot of experiments. We have models only using text tokens, models using full layout information at token level, and a model using line-level information (the monograph model will very likely work at block level).

EysiW commented 4 years ago

Hello again, Many of the features used in the high-level segmentation apply to a token, however the features are used to describe a TextLine from ALTO. How does this work in practise? The high-level segmentation model considers a sequence of TextLines if I understand it correctly?

kermitt2 commented 4 years ago

Yes high-level segmentation works as a sequence of lines. The features for the segmentation model are defined based either on the complete line, or the block where the line appears, or to the first two tokens of the line (assuming that these two tokens carry interesting information related to zoning).

When more training data for the segmentation model will be available, we will certainly review and refine these features (it's complicated to work on features with very few training data).

Working line by line is also a compromise for rapid processing, as compared to the token level which appeared to slow down the whole process quite significantly (initially the segmentation model was working as a sequence of tokens too).

EysiW commented 4 years ago

How do you determine if the first two tokens carry interesting information related to zoning?

kermitt2 commented 4 years ago

based on experimental evaluation of the model - evaluation of the f-score on 10% random for different set of features.

I tried with the 0, 1, 2 and 3 first tokens of the line, also with and without last one and 2 last ones... best result at the time was with the first 2 tokens of the line, in combination with line features and block features...

But as I said, it's to be revisited at some point with more training data and better layout features.

EysiW commented 4 years ago

That is interesting! Do you have any documentation on the impact of your different features? For example weight of different state transitions or some features that have a higher impact on specific classes.

kermitt2 commented 4 years ago

No I didn't keep the results of these experiments around feature selection. I only archive end-to-end benchmarking made for each new release. I consider GROBID in general as a work-in-progress, and as the training data grow, the experiments become quickly obsolete, I never feel the necessity to look at them again and I usually prefer to re-run various benchmarking over time when I feel it is worth spending time again on the feature engineering.