sebastianruder / NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
https://nlpprogress.com/
MIT License
22.73k stars 3.62k forks source link

Add UD POS tagging Results #620

Closed ExplorerFreda closed 5 months ago

ExplorerFreda commented 2 years ago

Added the UD POS tagging results from Substructure Substitution: Structured Data Augmentation for NLP (Shi et al., Findings of ACL 2021).

LifeIsStrange commented 2 years ago

@ExplorerFreda THANK YOU FOR THIS!!

does not need to be in this PR but note that the pen treebank recently had a new SOTA https://paperswithcode.com/paper/sequence-alignment-ensemble-with-a-single (didn't happen since 2018..) although as a reminder POS tag accuracy progress is blocked on the absurdity of the datasets and as always when it matters in AI research, no one seems to care.. not even Google. NLP datasets, including the pen treebank contain a high percentage of errors. The thing is we update models but we do not update the datasets, the pen treebank is the same since decades instead of being versionned. This inept (yet normalized) tragedy is well explained and quantified in https://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf

The huge (multiple percents) error rate in mainstream datasets and their non-evolution, is also demonstrated in https://labelerrors.com/ This has huge consequences including this unintended one on research

https://labelerrors.com/about#:~:text=Surprisingly%2C%20we%20find%20lower%20capacity%20models,5%25%20of%20accurately%20labeled%20test%20data. Surprisingly, we find lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on the ImageNet validation set with corrected labels: ResNet-18 outperforms ResNet-50 if we randomly remove just 6% of accurately labeled test data. On the CIFAR-10 test set with corrected labels: VGG-11 outperforms VGG-19 if we randomly remove just 5% of accurately labeled test data.

1) Errors in dataset seems to significantly lower/hide the accuracy gains possible via large neural networks. (How much untapped potential is there with this finding?) 2) Many key NLP/vision tasks have close to >90% accuracy, the share of the error they inherit from the ones in the dataset therefore often account for > 50% of all remaining possible accuracy gains, yet no one works on this and neural network research is hitting a wall/ai winter, one of the reason being better ideas might show worse results or no improvement, because of said errors. 3) There are cascading/exploding effects, since POS tagging is used in key downstream tasks such as Dependency parsing, it induce a lower bound on dependency parsing accuracy and most NLP tasks.

A common example of abandonware dataset is Wordnet, unlike its open source successor https://github.com/globalwordnet/english-wordnet which is much more complete and accurate and evolving as language evolve. All/most NLP datasets should be forked in an opensource organization and receive funding for improving accuracy. So much money is being directed at pointless or irrealist goals.. Why doesn't enterprises fund datasets accuracy evolution? Why doesn't they realize this is the most impactful roadbloack towards improving the state of the art and breaking the current NLP ai plateau/semi-winter. My personal answer is that as often no one cares, people just pretend to care, but most of the actions in AI are virtue signaling, marketing PR and hype driven short termism or irrealism. It's time for action.

As the paper shows, 84% of errors in POS taggers are because of pen treebank errors. Also POS taggers accuracy per sentence is extremely bad and block any serious NLU ambitions.

It is perhaps more realistic to look at the rate of getting whole sentences right, since a single bad mistake in a sentence can greatly throw off the usefulness of a tagger to downstream tasks such as dependency parsing. Current good taggers have sentence accuracies around 55– 57%, which is a much more modest score

Therefore improving POS tagging sentence accuracy via a simple, relatively low-financial cost, expert paid correction of the pen treebank, would lead to an explosion of possibilities for NLU research and products as it is the most salient bottleneck.

@ExplorerFreda So my question is: does the UD en POS tag dataset is versionned/improved over time unlike the pen treebank? @sebastianruder Could you voice this dataset accuracy non-evolution problem inside Google and other powerful circles? The two papers I linked show how urgent and impactful the problem is, and how actionable the solution is.

sebastianruder commented 5 months ago

Thanks for this PR! @LifeIsStrange, thanks for highlighting issues with existing datasets. That's something important to be aware of.