Language recognition? - Githubissues

sebastianruder / NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

https://nlpprogress.com/

MIT License

22.73k stars 3.62k forks source link

Language recognition? #600

Open ZedZipDev opened 2 years ago

ZedZipDev commented 2 years ago

Is there anything for language recognition? I.e. input: text , output: what is the text language

IgnatiusEzeani commented 2 years ago

Do you mean language identification task? See if any of these works can be of any help.

Yuliya-HV commented 2 years ago

You may want to check StanzaNLP language identification: https://stanfordnlp.github.io/stanza/langid.html

sebastianruder commented 2 years ago

Thanks for these pointers. The task is also abbreviated as language ID and is still far from solved (see this COLING 2020 paper for an overview of challenges). As far as I am aware, there is a lack of gold standard multilingual web-domain datasets for this task.

LifeIsStrange commented 2 years ago

https://paperswithcode.com/task/language-identification

LifeIsStrange commented 2 years ago

I wonder if this https://paperswithcode.com/paper/a-reproduction-of-apple-s-bi-directional-lstm is the current state of the art. The performance is not good at all... It seems to be a LSTM, I guess a transformer like BERT or better: XLnet would reach higher accuracy?