kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.57k stars 457 forks source link

Question: Does GROBID support multiple languages? #645

Closed andrei-volkau closed 4 years ago

andrei-volkau commented 4 years ago

Question: Does GROBID support multiple languages?

My thoughts: Sorry, I was not able to figure out the answer while searching the docs. I struggle to understand which parts are actually language-dependent. I struggle to understand which parts are language-independent also.

Example. Let me consider sentence segmentation functionality as an example. I mean sentence segmentation which is a part of the GROBID-dev. The sentence segmentation functionality is implemented using pragmatic_segmenter.

pragmatic_segmenter supports the following languages.

Does it mean that this kind of functionality will be working just for those languages? I tested it using a paper in Russian. The resulting sentence tokens seem to be reasonable.

kermitt2 commented 4 years ago

Hello!

The GROBID models themselves are language independent, in the sense that we use examples in different languages to train them (and it works better than training one model per language, even for English). However, beyond English, German, French, and a bit Spanish, other languages are not in the training data, so the performance will be lower because only based on the language independent layout and lexical features.

One other limit is the available text tokenizers, Grobid covers IndoEuropean languages, CKJ, Arabic, but that's it... we focus on scientific content so it's really OK I think.

PDF extraction itself (done by pdfalto) is complicated for some languages like Arabic, because of the weird differences between PDF stream order and reading order - it can have a very big impact.

Then there's a language recognizer for adding the xml:lang attributes and for selecting the text tokenizer, It's not Grobid, it's pluggable and depends on the actual implementation. Currently it's the cybozu labs ones which cover 53 languages (99% accuracy claimed on these 53 languages).

Finally, also "not Grobid", and a bit particular, the optional sentence segmentation was added recently to take advantage of the structure information of Grobid to improve the sentence segmentation (for example avoiding to split a sentence in the middle of a bibliographical reference, like the mainstream sentence segmenters), and it depends on the actual implementation:

Sentence segmentation appeared to be very useful for many further text mining process, so it was added as "core" functionality.

andrei-volkau commented 4 years ago

@kermitt2 thank you for the detailed reply!

  • pragmatic_segmenter covers many languages very well and is fast (and is the most accurate for English on scientific texts, from what we observed so far)

That is an interesting fact. Thank you. I expected that a sentence segmentation algorithm that is using the dependency parse to determine sentence boundaries would outperform any purely rule-based sentence segmentation algorithm. I am talking about a sentence segmentation approach used in SpaCy which is using the dependency parse.

Sentence segmentation appeared to be very useful for many further text mining process, so it was added as "core" functionality.

Yes, that is a cool feature. Having coordinates for each sentence is amazing as well!