Russian tagging and parsing models for CoreNLP

stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.

http://stanfordnlp.github.io/CoreNLP/

GNU General Public License v3.0

9.65k stars 2.7k forks source link

Russian tagging and parsing models for CoreNLP #480

Closed kamivao closed 5 years ago

kamivao commented 7 years ago

[we previously mailed the same letter to: parser-support@lists.stanford.edu ]

Hi!

My colleagues and I have implemented Russian models for tagging and dependency parsing in Stanford CoreNLP and kindly ask for advice.

By now we’ve done the following:

Tagger: 1.1 The model was trained on 200k tokens sample from Russian parts of parallel corpora for statistical machine translation (at http://statmt.org/). Total size of morphologically annotated subcorpus is 10 mln tokens. This subcorpus was annotated using SemSin parser. Each token was ascribed the following morphological information: Lemma + POS tag + morphological features, homonymy was resolved during parsing. SemSin is a slow but comparatively accurate dependency parser (tagging quality is above 0.96, best LAS is slightly above 0.80, both evaluated on news texts) which combines surface syntax and semantic role labels (but both types of labels do not follow any standard). SemSin descriptions are available in Russian language only, some information in English can be found in the abstract of this paper: [http://www.dialog-21.ru/media/1394/kanevsky.pdf].

1.2 SemSin tagset was mapped to Universal Dependencies (universal POS-tags and universal features)

Results: Russian tagger properties file: ud-tagger.props, trained Russian tagging model (labels input sentences with POS tags only).

Questions on tagger:

What is the best way to add information about morphological features and lexemes to the model? How this information should be represented in the training set? We have a large annotated training set with POS, morphofeatures and lemmas in UD v2 format and want to add all the information to the tagging model, because it is important for analysis of a morphologically rich language, like Russian.

Parser 2.1 Following the guidelines on training neural network dependency parser, we wrote RussianMorphoFeatureSpecification, RussianTreebankLanguagePack and several HeadFinders. The last RussianHeadFinder was written for finding heads in trees following UD representation and distinguishing main and subordinate clause, noun and prepositional phrase in Russian has become a problem. So, this version of the RussianHeadFinder is just a prototype where the structure of only the most frequent phrases is described.

2.2 To train the parser we used 1) a sample of 12,000 sentences in conll-u format as training set and a sample of 6,500 as dev set from SynTagRus treebank, 2) a small treebank of 1,000 sentences parsed with SemSin, automatically converted to UD and manually checked.

Word embeddings were built using the corpora at http://statmt.org/.

2.3 The following parameters were used for training : -tlp edu.stanford.nlp.trees.international.russian.RussianTreebankLanguagePack -trainFile ud-SytTagRus2.conllu -embedFile model400.txt -embeddingSize 100 -model nndep.rus.model.txt.gz -maxIter 20000 -numPreComputed 10000 -batchSize 1000 -dropProb 0.25 -hiddenSize 200 -initRange 0.005 -trainingThreads 4 -evalPerIter 2000 -devFile ru_syntagrus-ud-dev.conllu -language Russian

Results. UAS = 73,83 LAS = 67,67

At present we continue experiments with larger embeddings size, training set and number of iterations.

Questions. Is it possible to use morphological features in HeadFinder and as features for parser training? (In PennTreebank tagset different POS tags are used to denote morphological characteristics of words belonging to the same part of speech, but in UD grammatical properties are moved to morphologicalfeatures, which are not used by the parser). Therefore, if we’re not mistaken, during parser training on data in UD representation, a lot of information, which is useful for building feature templates and POS embeddings for a morphologically rich language, is lost.

Can we get any guidelines from you to elaborate and improve the models and contribute them to CoreNLP?

Link to project with all mentioned classes and models: https://github.com/MANASLU8/CoreNLP https://github.com/MANASLU8/CoreNLPRusModels

About us: we are NLP group in the Laboratory of Information Science and Semantic Technologies at the Department of Informatics and Applied Mathematics, ITMO University, Saint-Petersburg, Russia: http://iam.ifmo.ru/en/. Main research areas in NLP: linguistic resources for Russian, ontology population, grammar inference for spoken Russian language, voice interfaces for IoT

J38 commented 6 years ago

Sorry for the delayed response, but this seems really cool!

I'll talk to Chris about this. One thing I am unclear about is how we would build a tokenizer and sentence splitter for Russian text. Also, I will probably need to ask Chris about integrating the morphological features into the part of speech tagger. But it could be interesting to provide people with a basic pipeline for Russian!

kamivao commented 6 years ago

Hi!

Currently we used a default tokenizer and sentence spliiter without tuning it for Russian, but we can try to elaborate it to solve the problem. Yes, please ask about morpho features, because current implementation of mfeatures works too slow: we introduced codes for Russian grammemes like in Spanish package, but this resulted in 400+ different codes :)

We also think it will be an interesting and fruitful experience, if we can join our efforts on the Russian package for CoreNLP. What do you think about it?

P.S. Since summer we made some improvements to the models, which are described here https://link.springer.com/chapter/10.1007/978-3-319-69548-8_8.

anatoleg commented 6 years ago

Where you able to make a Russian pipeline work? I'd love to try it. You also mentioned SemSin. Is there a way to download it or try it?

kamivao commented 6 years ago

@anatoleg Hello, we added an example for running the pipeline (POS, lemmas, inflectional morphology, dependency parsing acc. to UD v2), see https://github.com/MANASLU8/CoreNLPRusModels/blob/master/README.md.

SemSin

It's better to contact SemSin developers directly, their emails can be found here http://www.dialog-21.ru/media/1394/kanevsky.pdf
Sorry for the delayed answer!

J38 commented 6 years ago

We are going to create a "model zoo" for Stanford CoreNLP, so I'd love to add the Russian models you have to that.

kamivao commented 6 years ago

@J38 Cool, please contact us in case of any questions :) By now we have a first implemented version of the pipeline (see readme and project repository). How can we contribute to including Russian language to the next CoreNLP release?

mjbriggs commented 5 years ago

Hello! @kamivao @J38, I am working out of Brigham Young University on a fork of a language learning web search application. More info on it can be found here https://github.com/reynoldsnlp/flair. We are trying to build on an existing application that makes use of the Stanford Core NLP for english and german. We intend to extend our application for russian, and stumbled upon this russian extension. Unfortunately, as our project exists, we cannot use the russian parser with the code supported by the Stanford Core NLP. I was wondering if you guys were considering a pull request and adding the russian language support to the Stanford NLP library. The english and german models that we are using work on both the russian extension and the Stanford Core. Thank you both for your time, I would greatly appreciate any update!

J38 commented 5 years ago

If you add the jar from the model zoo to your CLASSPATH you can run the Russian components. You will need to use the latest code from GitHub not version 3.9.2.

https://stanfordnlp.github.io/CoreNLP/model-zoo.html