Open p0p4k opened 10 months ago
I think these are good questions:) The current code makes sure that a fixed percentage (mask_prob
) of tokens are masked for every sequence, whereas the original paper seem to probabilistically mask mask_prob
% of tokens in expectation (so the number of masked tokens can vary sequence by sequence).
My hunch is that the latter is a more general augmentation compared to the former, so may be more robust, but I haven't tested this hypothesis.
From the paper
https://github.com/lucidrains/x-transformers/blob/90cef69e272f74756a0e5aa1ddd4523c0a23e49a/x_transformers/autoregressive_wrapper.py#L274-L280
I am still trying to understand the code,
I have 2 questions : (1) Will there ever be a case where
seq-1
is bigger thanint(seq * self.mask_prob)
if we already have asserted themask_prob
is always <1. earlier in the code? (2) We are masking with a probability value, doesnt it mean sometimes the model might get to see more than (1-mask_prob) tokens? But here we force the ratio throughout? And then does using normal vs uniform make any big difference? Thanks!