sebastianruder / NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
https://nlpprogress.com/
MIT License
22.52k stars 3.61k forks source link

Add stemming and lemmatisation section #304

Open LifeIsStrange opened 5 years ago

LifeIsStrange commented 5 years ago

According to the List_of_unsolved_problems_in_computer_science

Is there any perfect stemming algorithm in the English language?

I believe that lemmatization is not solved too.

It would be wonderful to add the states of the arts in both tasks. BTW, lemmatization consists for example of transforming the conjugated verb: jumped to his noun form: jump. Does a tool that takes in argument a word e.g fast and another argument specifying the requested part of speech form an e.g adverb which would output fastly. In fact, stemming and lemmatization are a special case of the NLP task I need. If it exists, does someone know how it's called? Where could I ask? Sorry for the digression.

LifeIsStrange commented 5 years ago

benchmarks: http://universaldependencies.org/conll18/results-lemmas.html?source=post_page--------------------------- BTW great writeup at https://towardsdatascience.com/state-of-the-art-multilingual-lemmatization-f303e8ff1a8

LifeIsStrange commented 5 years ago

so if en mean english: SOTAs -> en_ewt: 97.23 en_gum: 96.18 en_lines: 96.56 en_pud: 96.39

which are not that much accurate...

sebastianruder commented 5 years ago

Thanks for the note! Would you mind taking the lead on this, i.e. adding some state-of-the-art results for lemmatization and/or stemming? I think the task that you're looking for is morphological reinflection. Note that you need not only the part-of-speech but the remaining morphosyntactic features (otherwise the problem is underspecified).