Question: Does GROBID support multiple languages?

andrei-volkau commented 4 years ago

Question: Does GROBID support multiple languages?

My thoughts: Sorry, I was not able to figure out the answer while searching the docs. I struggle to understand which parts are actually language-dependent. I struggle to understand which parts are language-independent also.

Example. Let me consider sentence segmentation functionality as an example. I mean sentence segmentation which is a part of the GROBID-dev. The sentence segmentation functionality is implemented using pragmatic_segmenter.

pragmatic_segmenter supports the following languages.

English
Amharic
Arabic
Armenian
Burmese
Chinese
Greek
Hindi
Japanese
Persian
Urdu

Does it mean that this kind of functionality will be working just for those languages? I tested it using a paper in Russian. The resulting sentence tokens seem to be reasonable.

kermitt2 commented 4 years ago

Hello!

The GROBID models themselves are language independent, in the sense that we use examples in different languages to train them (and it works better than training one model per language, even for English). However, beyond English, German, French, and a bit Spanish, other languages are not in the training data, so the performance will be lower because only based on the language independent layout and lexical features.

One other limit is the available text tokenizers, Grobid covers IndoEuropean languages, CKJ, Arabic, but that's it... we focus on scientific content so it's really OK I think.

PDF extraction itself (done by pdfalto) is complicated for some languages like Arabic, because of the weird differences between PDF stream order and reading order - it can have a very big impact.

Then there's a language recognizer for adding the xml:lang attributes and for selecting the text tokenizer, It's not Grobid, it's pluggable and depends on the actual implementation. Currently it's the cybozu labs ones which cover 53 languages (99% accuracy claimed on these 53 languages).

Finally, also "not Grobid", and a bit particular, the optional sentence segmentation was added recently to take advantage of the structure information of Grobid to improve the sentence segmentation (for example avoiding to split a sentence in the middle of a bibliographical reference, like the mainstream sentence segmenters), and it depends on the actual implementation:

pragmatic_segmenter covers many languages very well and is fast (and is the most accurate for English on scientific texts, from what we observed so far)
OpenNLP only includes English languages (we could add the models for 5 other languages, da, de, nl, se, pt - but adding OpenNLP was more an exercise to test the pluggable sentence splitter mechanism).

Sentence segmentation appeared to be very useful for many further text mining process, so it was added as "core" functionality.

andrei-volkau commented 4 years ago

@kermitt2 thank you for the detailed reply!

pragmatic_segmenter covers many languages very well and is fast (and is the most accurate for English on scientific texts, from what we observed so far)

That is an interesting fact. Thank you. I expected that a sentence segmentation algorithm that is using the dependency parse to determine sentence boundaries would outperform any purely rule-based sentence segmentation algorithm. I am talking about a sentence segmentation approach used in SpaCy which is using the dependency parse.

Sentence segmentation appeared to be very useful for many further text mining process, so it was added as "core" functionality.

Yes, that is a cool feature. Having coordinates for each sentence is amazing as well!

kermitt2 / grobid

Question: Does GROBID support multiple languages? #645