source truncation size in summarization task

harvardnlp / encoder-agnostic-adaptation

Encoder-Agnostic Adaptation for Conditional Language Generation

https://arxiv.org/abs/1908.06938

MIT License

79 stars 13 forks source link

source truncation size in summarization task #3

Open XinyuHua opened 5 years ago

XinyuHua commented 5 years ago

Hi,

According to the README file, for summarization (cnndm) task the following truncation setup is recommended: -src_seq_length_trunc 400

However, on the training data, the average/median length of the source is 925/841, more than 90% of the data is longer than 400 BPE tokens, would it be problematic to throw away the rest of the text? Or is this simply for efficiency consideration? Thanks!

zackziegler95 commented 5 years ago

Hi,

This is a preprocessing choice we inherited from previous summarization work with OpenNMT, which found that the first 400 tokens is often plenty to compose a good summary. That work was largely conducted with LSTMs, though, so perhaps performance would improve measurably by increasing this truncation.