microsoft / evodiff

Generation of protein sequences and evolutionary alignments via discrete diffusion models
MIT License
526 stars 73 forks source link

Allow diffusion to gap character in OADM models? #22

Closed elibixby closed 1 year ago

elibixby commented 1 year ago

In the generation code for the OADM models, we are explicitly preventing the sampling of gap characters (e.g https://github.com/microsoft/evodiff/blob/main/evodiff/generate_msa.py#L211 )

Is this necessary, or an explicit choice to only sample fixed length sequences given a sub-sampled MSA. Given that MASK and GAP are distinct tokens, and the ground truth alignment of the query sequence can contain gap tokens, it seems allowing diffusion to GAP would be desirable.

I noticed in the conditional MSA generation this check isn't present https://github.com/microsoft/evodiff/blob/main/evodiff/conditional_generation_msa.py#L605C1-L605C102 (despite the comment's claim), would like to understand better what the difference here is.

yangkky commented 1 year ago

In theory, it shouldn't be necessary to prevent sampling of gap characters. In practice, we find we get better generations sometimes with the restriction. Feel free to play around with it though!