reynoldsnlp / udar

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.
GNU General Public License v3.0
26 stars 1 forks source link

collect gold-standard corpora #26

Open reynoldsnlp opened 4 years ago

reynoldsnlp commented 4 years ago

We need a large collection of gold-standard disambiguated Russian texts for FST/CG testing. One way or another, this will require converting tags and format to udar/CG3. Some possibilities include:

reynoldsnlp commented 4 years ago

It looks like SynTagRus has now been published in a Universal Dependencies format: https://github.com/UniversalDependencies/UD_Russian-SynTagRus/tree/master

reynoldsnlp commented 4 years ago

also other UD treebanks exist: https://universaldependencies.org/#russian-treebanks