Closed andrei-volkau closed 4 years ago
Hello!
The GROBID models themselves are language independent, in the sense that we use examples in different languages to train them (and it works better than training one model per language, even for English). However, beyond English, German, French, and a bit Spanish, other languages are not in the training data, so the performance will be lower because only based on the language independent layout and lexical features.
One other limit is the available text tokenizers, Grobid covers IndoEuropean languages, CKJ, Arabic, but that's it... we focus on scientific content so it's really OK I think.
PDF extraction itself (done by pdfalto
) is complicated for some languages like Arabic, because of the weird differences between PDF stream order and reading order - it can have a very big impact.
Then there's a language recognizer for adding the xml:lang
attributes and for selecting the text tokenizer, It's not Grobid, it's pluggable and depends on the actual implementation. Currently it's the cybozu labs ones which cover 53 languages (99% accuracy claimed on these 53 languages).
Finally, also "not Grobid", and a bit particular, the optional sentence segmentation was added recently to take advantage of the structure information of Grobid to improve the sentence segmentation (for example avoiding to split a sentence in the middle of a bibliographical reference, like the mainstream sentence segmenters), and it depends on the actual implementation:
pragmatic_segmenter
covers many languages very well and is fast (and is the most accurate for English on scientific texts, from what we observed so far)OpenNLP
only includes English languages (we could add the models for 5 other languages, da, de, nl, se, pt - but adding OpenNLP was more an exercise to test the pluggable sentence splitter mechanism).Sentence segmentation appeared to be very useful for many further text mining process, so it was added as "core" functionality.
@kermitt2 thank you for the detailed reply!
pragmatic_segmenter
covers many languages very well and is fast (and is the most accurate for English on scientific texts, from what we observed so far)
That is an interesting fact. Thank you. I expected that a sentence segmentation algorithm that is using the dependency parse to determine sentence boundaries would outperform any purely rule-based sentence segmentation algorithm. I am talking about a sentence segmentation approach used in SpaCy which is using the dependency parse.
Sentence segmentation appeared to be very useful for many further text mining process, so it was added as "core" functionality.
Yes, that is a cool feature. Having coordinates for each sentence is amazing as well!
Question: Does GROBID support multiple languages?
My thoughts: Sorry, I was not able to figure out the answer while searching the docs. I struggle to understand which parts are actually language-dependent. I struggle to understand which parts are language-independent also.
Example. Let me consider sentence segmentation functionality as an example. I mean sentence segmentation which is a part of the GROBID-dev. The sentence segmentation functionality is implemented using pragmatic_segmenter.
pragmatic_segmenter supports the following languages.
Does it mean that this kind of functionality will be working just for those languages? I tested it using a paper in Russian. The resulting sentence tokens seem to be reasonable.