huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

132k stars 26.29k forks source link

Add POINTER model #8454

Open dreasysnail opened 3 years ago

dreasysnail commented 3 years ago

🌟 New model addition

Model description

POINTER is a progressive and non-autoregressive text generation pre-training approach, published on EMNLP 2020 by Microsoft Research. POINTER generates fluent text in a progressive and parallel manner. With empirical logarithmic time, POINTER outperforms existing non-autoregressive text generation approaches in hard-constrained text generation.

The model uses basically BERT-large architecture. However, an additional token is added to the vocab. The inference is performed by passing the input iteratively to the model. Since there is no existing model architecture in Huggingface that is compatible, I am not sure how to incorporate this into the model card.

Open source status

[x] the model implementation is available: (https://github.com/dreasysnail/POINTER)
[x] the model weights are available: here
[x] who are the authors: @dreasysnail

dreasysnail commented 3 years ago

Thanks @patrickvonplaten for taking this. It's nice to work with you again :)

stefan-it commented 3 years ago

Really interesting approach :hugs:

@dreasysnail Do you think it is possible to pre-train a model from scratch on one GPU in a reasonable time? Could you say something about your used hardware setup and training time for the pre-training phase :thinking:

dreasysnail commented 3 years ago

Thanks @stefan-it ! Regarding your question:

@dreasysnail Do you think it is possible to pre-train a model from scratch on one GPU in a reasonable time? Could you say something about your used hardware setup and training time for the pre-training phase 🤔

The speed advantage of this algorithm is more on the decoding side. For the training time, you can expect this takes roughly similar amount of time comparing to, say, fine-tuning a BERT. One GPU is possible but if your dataset is large the training could be slow. So I would recommend you fine-tune from what we have already pretrained for fast convergence and better quality.

For your reference, we were using 8/16*V100 GPUs to pretrain and fine-tune the models. The pretraining takes roughly one week and the fine-tuning takes 1-2 days.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.