Open inspirit opened 2 years ago
@inspirit but you need at least one token that outputs a logit
yeah, or maybe the other way is to randomly curtail the prefix during training, in which case it will generalize on being conditioned from 0 prefix length to the maximum
feel like the paper should have addressed this, especially if book-level autoregressive generation is the goal here
i think having prefix and query dynamically sized is the best for robustness as well as inference usage
@inspirit yeah, maybe i'll just have to push this responsibility to the dataloading
@lucidrains I searched in the paper I don't see a prefix length mentioned. I am confused about this prefix length issue, wouldn't you want the prefix length to be the full size of the context window? ( I guess you guys meant the length of the latents )
I think we can get away with having learnable null-latents in case we dont have prefix initially