huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.27k stars 26.85k forks source link

BART for generating sequence of length more than 1024 tokens #10451

Closed silentghoul-spec closed 3 years ago

silentghoul-spec commented 3 years ago

I was using pretraining code given in transformers/examples/seq2seq to finetune on my custom dataset containing summaries of the text of greater than 1024 tokens. But I am getting an error regarding index out of bounds error. Is it possible to fine-tune BART to generate summaries of more than 1024 tokens? I have added log file for reference.

v100job.txt

patil-suraj commented 3 years ago

Hi @silentghoul-spec

for BART the maximum sequence length is 1024, so it can't process text larger than 1024 tokens. You could use the LED model for long document summarization, here's a notebook which demonstrates how to use LED https://github.com/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb

Also please use the forum https://discuss.huggingface.co/ to ask such questions first

silentghoul-spec commented 3 years ago

Thanks, @patil-suraj , I wondered why it has a token limit of 1024 as the original paper https://arxiv.org/pdf/1910.13461.pdf didn't have any mentioned limit as such. I guess it's because BART model cards currently available were trained with encoder having a limit of 1024 tokens. Btw thanks for pointing me to discussion forums; I will use them for further discussions.