Open bugtig opened 4 years ago
Not sure if this helps, but basing my work on what is thus far in the huggingface repo I did this:
if len(sequence) > block_size:
# ensure inclusion of whole sentences
sent_sep_idxs = [ idx for idx, t in enumerate(sequence) if t == sep_token_id and idx < block_size ]
last_sent_sep_idx = min(max(sent_sep_idxs)+1 if len(sent_sep_idxs) > 0 else block_size, block_size)
sequence = sequence[:last_sent_sep_idx]
if len(sequence) < block_size:
sequence.extend([pad_token_id] * (block_size - len(sequence)))
return sequence
If the sequence is longer than max_len
I first attempt to reduce the size of the sequence without partial sentences. If for whatever reason there is a sentence that is longer than max_len
I simply truncate it to be equal to max_len
Hello,
Thank you for this fantastic code release! I'm currently running the abstractive summarization on input longer than 512, and changed the max_pos arg in train.py accordingly.
But I noticed that you also mentioned to set max_pos arg in both preprocessing and training, but I can't find where the max_pos argument is set the preprocessing step.
I would really appreciate it if you could clarify what you mean about using max_pos in preprocessing!
This is my question too! The updated section in the documentation says: "For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training". However, there is no max_pos in the preprocessing code. Could you please clarify what exactly it means to set max_pos during preprocessing?
@nlpyang Could you answer the question, please? There is no "max_pos" in preprocess.py indeed.
@nlpyang Kindly help in this regard. If we want to consider the whole document with its full length, do we just need to replace 512 everywhere with our desired number? Thanks in advance Help much appreciated
Updates: For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.
So, lets say I want to consider documents with 2500 words, I can set max-pos to 2500?
@nlpyang
Hello,
Thank you for this fantastic code release! I'm currently running the abstractive summarization on input longer than 512, and changed the max_pos arg in train.py accordingly.
But I noticed that you also mentioned to set max_pos arg in both preprocessing and training, but I can't find where the max_pos argument is set the preprocessing step.
I would really appreciate it if you could clarify what you mean about using max_pos in preprocessing!