huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.06k stars 26.31k forks source link

Request: pretrained distilgpt2-medium, distilgpt2-large models #4969

Closed joeyism closed 4 years ago

joeyism commented 4 years ago

Plans for distilgpt2-medium and distilgpt2-large

Motivation

While distilgpt2 is useful, I was wondering if there are any plans to create a distilgpt2-medium and distilgpt2-large. I'm also wondering how the result of distilgpt2-medium compare to gpt2, and distilgpt2-large compare to gpt2-medium, in size and performance.

Maybe it's not even worth it to have those pretrained, if distilgpt2-medium is larger than gpt2 and perform worse.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

abisee commented 4 years ago

I'd also be interested in this. The current distilgpt2 is great for use-cases that need cheap/fast compute, but distilled versions of the larger gpt2 models (medium, large, xl) would also be super useful. For example, I am able to fit up to gpt2-large on my GPU, but I'm unable to fit gpt2-xl, which means I can't use it. If there was a distilled version of gpt2-xl which was smaller, that might make it usable for more people.

Are there any plans to distill any larger versions of gpt2?

Thanks!

VictorSanh commented 4 years ago

Yes we can probably work on that. There is a bit of work + exploration to do: it is possible that we'll have to use model parallelism tricks to be able to train it in a reasonable time (I haven't checked yet). Applying the distillation to gpt2-xl the way we did for distilgpt2 (same ratios) would still result in a model that is bigger than gpt2-medium (24L, 1600 hidden dim). Would that fit your use-case?

(sorry for the delayed answer, I don't usually check issues without being pinged/tagged).

abisee commented 4 years ago

Applying the distillation to gpt2-xl the way we did for distilgpt2 (same ratios) would still result in a model that is bigger than gpt2-medium (24L, 1600 hidden dim). Would that fit your use-case?

Yes, if we could squish the performance of gpt2-xl into something sized between gpt2-medium and gpt2-large, that would be really useful!

joeyism commented 4 years ago

Yes we can probably work on that. There is a bit of work + exploration to do: it is possible that we'll have to use model parallelism tricks to be able to train it in a reasonable time (I haven't checked yet). Applying the distillation to gpt2-xl the way we did for distilgpt2 (same ratios) would still result in a model that is bigger than gpt2-medium (24L, 1600 hidden dim). Would that fit your use-case?

(sorry for the delayed answer, I don't usually check issues without being pinged/tagged).

Even a distilgpt2-large would work for my use case

jokebroker commented 4 years ago

I am also interested in a distilled version of the larger models. For our use-case, this would go a long way to improving cost/performance/feasibility.

jokebroker commented 3 years ago

Bumping this - any word on availability of the medium/large distilled models ?

VictorSanh commented 3 years ago

Bumping this - any word on availability of the medium/large distilled models ?

I am currently working on it! :)

bjoernhommel commented 3 years ago

any news on this?

PrithivirajDamodaran commented 3 years ago

Any news on this ? 😊

MTSowbug commented 2 years ago

I would be extremely interested in having GPT2-XL distilled to the size of GPT2-L or smaller. Consumer-grade GPUs currently top out at around 8GB VRAM, which is enough to run inference using GPT2-L but is not enough for GPT2-XL. Unless you can find a beefier GPU than that, it will only become possible to efficiently run GPT2-XL on a desktop PC when someone trains a distilled model.

Zhreyu commented 1 month ago

hey, any news on this?