Create LLM corpus - Githubissues

danielholanda commented 1 year ago

Issue description

Some LLMs are currently part of the popular_on_huggingface corpus of MLAgility. Those models are significantly large, causing our benchmarking infrastructure to take significantly longer when processing them.

Task

Create corpus composed only of LLM models, since those models must be treated differently than most models.

Suggested implementation

Move all LLMs from popular_on_huggingface to a corpus called LLMs
Add new popular LLMs that are not already part of the popular_on_huggingface corpus
Create models from configs rather than loading pre-trained models to avoid long download times.

Example:

from transformers import OpenAIGPTModel, OpenAIGPTConfig
model = OpenAIGPTModel(config=OpenAIGPTConfig())

Instead of

from transformers import OpenAIGPTModel
model = OpenAIGPTModel.from_pretrained("openai-gpt")

Suggested models

Decoder-Only

2023
- LLaMA (Meta)
2022
- OPT-IML (Meta)
- BLOOM (BigScience)
- BLOOMZ (BigScience)
- Galactica (Meta)
- OPT (Meta)
- YaLM (Yandex)
- GPT-NeoX (EleutherAI)
2021
- GPT-J (EleutherAI)
- GPT-Neo (EleutherAI)
2019
- XLNet (Google)
- GPT-2 (OpenAI)
2018
- GPT-1 (OpenAI)

Encoder-Decoder

2023
- Chat GLM (THUDM)
- Flan UL2 (Google)
2022
- Flan T5 (Google)
- UL2 (Google)
- Tk (Huggingface)
2021
- GLM (THUDM)
- Switch (Google)
2020
- mT5 (Google)
- T0 (BigScience)
2019
- T5 (Google)
- BART (Meta)

Encoder-Only

2020
- Deberta (Microsoft)
- ELECTRA (Stanford)
2019
- RoBERTa (Meta)
- ALBERT (Google)
- DistillBERT (Huggingface) - Not an LLM?
- ERNIE (Baidu)
2018
- BERT (Google)

Suggested list of mandatory labels for LLMs

arch::encoder_only,decoder_only,encoder_decoder
author::<>
year::<>

danielholanda commented 1 year ago

@jeremyfowers @ramkrishna2910

ramkrishna2910 commented 1 year ago

Pointer to LLMs: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

jeremyfowers commented 1 year ago

I will introduce two corpora:

llm: collection of LLM models
llm_layer: same LLMs models as llm, but each "model" is just a single layer of the original LLM. This is useful for faster handling, less disk/ram use, benchmarking on smaller units of HW, etc.

jeremyfowers commented 1 year ago

I will also likely break out anything to do with sanitizing popular_on_huggingface into a separate issue.

danielholanda commented 1 year ago

I really like the idea of having both llm and llm_layer!

jeremyfowers commented 1 year ago

https://pypi.org/project/detoxify/ is a PyPI package required by 3 models in our popular_on_huggingface corpus. However, that package requires transformers==4.22.1, which is pretty old and doesn't have key LLMs like LLaMA. detoxify also appears to be abandoned, with no updates since 2021.

I plan to remove those 3 models from popular_on_huggingface to unblock the LLM work. I don't think those 3 models specifically are a big deal, but this does highlight a potentially bad trend of dep conflicts between our models. Those conflicts seem inevitable as we grow our set of models. We may need to have per-corpus deps or something.

groq / mlagility

Create LLM corpus #314

Issue description

Task

Suggested implementation

Suggested models

Decoder-Only

Encoder-Decoder

Encoder-Only

Suggested list of mandatory labels for LLMs