setting max_pos during preprocessing?

nlpyang / PreSumm

code for EMNLP 2019 paper Text Summarization with Pretrained Encoders

MIT License

1.28k stars 463 forks source link

setting max_pos during preprocessing? #97

Open bugtig opened 4 years ago

bugtig commented 4 years ago

Hello,

Thank you for this fantastic code release! I'm currently running the abstractive summarization on input longer than 512, and changed the max_pos arg in train.py accordingly.

But I noticed that you also mentioned to set max_pos arg in both preprocessing and training, but I can't find where the max_pos argument is set the preprocessing step.

I would really appreciate it if you could clarify what you mean about using max_pos in preprocessing!

ohmeow commented 4 years ago

Not sure if this helps, but basing my work on what is thus far in the huggingface repo I did this:

    if len(sequence) > block_size:
        # ensure inclusion of whole sentences
        sent_sep_idxs = [ idx for idx, t in enumerate(sequence)  if t == sep_token_id and idx < block_size ]

        last_sent_sep_idx = min(max(sent_sep_idxs)+1 if len(sent_sep_idxs) > 0 else block_size, block_size)

        sequence = sequence[:last_sent_sep_idx]

    if len(sequence) < block_size:
        sequence.extend([pad_token_id] * (block_size - len(sequence)))

    return sequence

If the sequence is longer than max_len I first attempt to reduce the size of the sequence without partial sentences. If for whatever reason there is a sentence that is longer than max_len I simply truncate it to be equal to max_len

hamidreza-ghader commented 3 years ago

Hello,

Thank you for this fantastic code release! I'm currently running the abstractive summarization on input longer than 512, and changed the max_pos arg in train.py accordingly.

But I noticed that you also mentioned to set max_pos arg in both preprocessing and training, but I can't find where the max_pos argument is set the preprocessing step.

I would really appreciate it if you could clarify what you mean about using max_pos in preprocessing!

This is my question too! The updated section in the documentation says: "For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training". However, there is no max_pos in the preprocessing code. Could you please clarify what exactly it means to set max_pos during preprocessing?

tomql commented 3 years ago

@nlpyang Could you answer the question, please? There is no "max_pos" in preprocess.py indeed.

AyeshaSarwar commented 3 years ago

@nlpyang Kindly help in this regard. If we want to consider the whole document with its full length, do we just need to replace 512 everywhere with our desired number? Thanks in advance Help much appreciated

AyeshaSarwar commented 3 years ago

Updates: For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

So, lets say I want to consider documents with 2500 words, I can set max-pos to 2500?

@nlpyang