mtm80 / russ-project

2 stars 0 forks source link

Lemmatizing Russian #23

Closed richiebful closed 6 years ago

richiebful commented 6 years ago

Is there an accepted authority for lemmatizing Russian text out there? The Yandex Mystem project seems promising. There's also a Python wrapper that would make it easier to lemmatize in-place on the XML. I was thinking this would be helpful if we decide to do topic modelling.

djbpitt commented 6 years ago

Yandex Mystem has received good reviews, but I've never used it. The dictionary that Elise and I have been building is accessible, and we can show you how to interact with it, but it's still missing a lot of lexemes, so if you can get decent results from Mystem, I think that's a better choice for now. Not to mention that, since I've never used it myself, I'd be eager to learn how it performs in your experience.

Intuitively it seems as if lemmatization would improve topic-modeling performance, especially with relatively short texts. It might be interesting to run the same texts with and without lemmatization, though, to see whether there is a difference in practice.