Closed silentghoul-spec closed 3 years ago
Hi @silentghoul-spec
for BART the maximum sequence length is 1024, so it can't process text larger than 1024 tokens.
You could use the LED
model for long document summarization, here's a notebook which demonstrates how to use LED https://github.com/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb
Also please use the forum https://discuss.huggingface.co/ to ask such questions first
Thanks, @patil-suraj , I wondered why it has a token limit of 1024 as the original paper https://arxiv.org/pdf/1910.13461.pdf didn't have any mentioned limit as such. I guess it's because BART model cards currently available were trained with encoder having a limit of 1024 tokens. Btw thanks for pointing me to discussion forums; I will use them for further discussions.
I was using pretraining code given in transformers/examples/seq2seq to finetune on my custom dataset containing summaries of the text of greater than 1024 tokens. But I am getting an error regarding index out of bounds error. Is it possible to fine-tune BART to generate summaries of more than 1024 tokens? I have added log file for reference.
v100job.txt