stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.23k stars 887 forks source link

Add Abkhaz #485

Open Bachstelze opened 3 years ago

Bachstelze commented 3 years ago

How can we add the abkhazian language? There are a few resources like https://gitlab.com/Bachstelze/alp and https://github.com/danielinux7/Multilingual-Parallel-Corpus . Can we port those models to stanza or do we have to retrain them?

ftyers commented 3 years ago

@Bachstelze it looks like the first step might be to annotate a corpus in Universal Dependencies. I'd be interested in working on that, please feel free to contact me if you are too.

Bachstelze commented 3 years ago

Are there proven and known ways to generate treebanks from scratch for post-editing? Is it possible to start with pos tagging and then preparse UD?

ftyers commented 3 years ago

Maybe, but it would take you longer and you would end up with a worse end result. It's easier to just annotate from scratch. If there is glossed or tagged text this can be used to bootstrap a conversion. You could for example use UD Annotatrix (with apologies for the orthography): Peek 2020-10-13 23-49 You can skip some of the steps if you have a decent part-of-speech tagger, or a glossed corpus. I'm guessing that for Abkhaz morphological analysis would also be needed if you want to fill out the FEATS column. Anyway, I think that it would make a nice project.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.