openai / evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Other
14.76k stars 2.58k forks source link

Evaluate GPT-4 on classical NLP tasks #246

Open LifeIsStrange opened 1 year ago

LifeIsStrange commented 1 year ago

Addressing the elephant in the room

When the concept of transformers were first unleashed, their revolutionnary accuracy results where mostly shown in the standard NLP tasks, such as POS-tagging, dependency parsing, coreference resolution, WSD, etc.. But I've observed, since PALM and other very large language models, the published benchmarks results are on much higher level tasks, such as common sense reasoning tests, question answering, etc Both sets of benchmarks are useful and needed, but I would like to highlight that the standard NLP tasks are now completely under-benchmarked by those newer language models and that this impairs progress towards AGI or industrial uses.

If it could be argued, that purely symbolic AI progress has stalled since decades, there is a real huge potential for neuro-symbolic hybrid systems that uses neural networks for low level analysis tasks (POS-tag, etc), and feed those linguistic data to other higher level neural networks or to symbolic systems, in order to push the boundaries of what is possible, especially regarding semantic analysis AKA true NLU systems.

### foundational NLP tasks of interest:
- [ ] [Dependency parsing](https://paperswithcode.com/sota/dependency-parsing-on-penn-treebank)
- [ ] [word sense disambiguation](https://paperswithcode.com/sota/word-sense-disambiguation-on-supervised)
- [ ] [Coreference resolution](https://paperswithcode.com/sota/coreference-resolution-on-ontonotes)
- [ ] [POS tagging](https://paperswithcode.com/sota/part-of-speech-tagging-on-penn-treebank)
- [ ] others

Therefore this issue is a call of contributions for implementing evals on those standard tasks, especially dependency parsing. I believe GPT-4 has the potential to improve the SOTA in at least some foundational NLP tasks and an even greater potential once someone finetune it and combine it to domain specific optimizations (as is currently done on BERT SOTAs, such as HPSG for dependency parsing).

andrew-openai commented 1 year ago

Great idea!

sudarshansivakumar commented 9 months ago

I know this thread is from a while back but curious if anyone has managed to do this?

Mukhsin0508 commented 9 months ago

I tried to do this, still working on it! how about you?

чт, 14 дек. 2023 г. в 22:29, sudarshansivakumar @.***>:

I know this thread is from a while back but curious if anyone has managed to do this?

— Reply to this email directly, view it on GitHub https://github.com/openai/evals/issues/246#issuecomment-1856272676, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6W7PNBMGUYUP6OUPFUHDJ3YJMZPNAVCNFSM6AAAAAAV5IKFB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJWGI3TENRXGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>