Is this necessary, or an explicit choice to only sample fixed length sequences given a sub-sampled MSA. Given that MASK and GAP are distinct tokens, and the ground truth alignment of the query sequence can contain gap tokens, it seems allowing diffusion to GAP would be desirable.
In theory, it shouldn't be necessary to prevent sampling of gap characters. In practice, we find we get better generations sometimes with the restriction. Feel free to play around with it though!
In the generation code for the OADM models, we are explicitly preventing the sampling of gap characters (e.g https://github.com/microsoft/evodiff/blob/main/evodiff/generate_msa.py#L211 )
Is this necessary, or an explicit choice to only sample fixed length sequences given a sub-sampled MSA. Given that
MASK
andGAP
are distinct tokens, and the ground truth alignment of the query sequence can contain gap tokens, it seems allowing diffusion toGAP
would be desirable.I noticed in the conditional MSA generation this check isn't present https://github.com/microsoft/evodiff/blob/main/evodiff/conditional_generation_msa.py#L605C1-L605C102 (despite the comment's claim), would like to understand better what the difference here is.