microsoft / MASS

MASS: Masked Sequence to Sequence Pre-training for Language Generation
https://arxiv.org/pdf/1905.02450.pdf
Other
1.12k stars 206 forks source link

Quick question about "masked_block_start" #163

Closed Derekkk closed 4 years ago

Derekkk commented 4 years ago

Hi,

Thanks for sharing the code! I have a quick question that in "MASS-summarization/masked_dataset.py", it seems you chose multiple spans from src_items as targets:

masked_pos = [] 
        for i in range(1, len(src_item), self.block_size):  
            block = positions[i: i + self.block_size] 
            masked_len = int(len(block) * self.mask_prob) 
            masked_block_start = np.random.choice(block[:len(block) - int(masked_len) + 1], 1)[0] 
            masked_pos.extend(positions[masked_block_start : masked_block_start + masked_len])
        masked_pos = np.array(masked_pos) 

and the targets are the concat of all chosen spans. E.g., src = [1,2,3,4,mask,mask,7,8,mask,mask,11,12] and tgt = [5,6,9,10]

I want confirm if my understanding is correct since in the original paper you only chose on segment for each input. Thanks a lot!

StillKeepTry commented 4 years ago

Yes, your understanding is correct. The task of our original paper is only focused on sentence-level. While some summarization tasks (like cnndm) are at the document-level. To handle the longer context (e.g., 512 tokens or more), we adopt multiple spans masked strategy to process long context.