microsoft / evodiff

Generation of protein sequences and evolutionary alignments via discrete diffusion models
MIT License
516 stars 73 forks source link

Using generate_msa() #19

Closed mcale6 closed 12 months ago

mcale6 commented 1 year ago

When using MSA_OA_DM_MAXSUB() and generate_msa() I get the following error:

 in A3MMSADataset.__init__(self, selection_type, n_sequences, max_seq_len, data_dir, min_depth)
    351     _lengths = np.load(data_dir+'openfold_lengths.npz')['ells']
    352     lengths = np.array(_lengths)[keep_idx]
--> 353     all_files = np.array(all_files)[keep_idx]
    354     print("filter MSA depth > 64", len(all_files))
    356 # Re-filter based on high gap-contining rows

IndexError: index 1 is out of bounds for axis 0 with size 1

Do I need to have the Openfold datatset (not only the weights but also the MSA's?) in the evodiff/data/openfold folder?

sarahalamdari commented 12 months ago

The generate_msa() function is going to generate a new query MSA from an aligned MSA sampled from openfold, so in this case yes it would need the openfold dataset

If you want to generate using a custom MSA i would suggest the generate_query_oadm_msa_simple() function

brycejoh16 commented 11 months ago

Does your comment mean we can "generate custom MSA's" from a given query sequence using generate_query_oadm_msa_simple() function. I looked through the code on this function, to the best of my knowledge it looks like it generates a query sequence given a custom MSA, not the other way around.

Additional questions:

(1) Can we generate a custom MSA from query sequence using any existing functions in the code, or will we have to write our own?

(2) If you wanted to generate 1,000 different query sequences given a custom MSA, how would one go about doing that? I'm assuming just call generate_query_oadm_msa_simple() many times, correct? And each time sample 64 different sequences from the custom MSA, correct?

This paper was also awesome, thanks for doing this Evodiff team!

sarahalamdari commented 11 months ago

It's certainly possible to generate an MSA given a query sequence, although as a caveat we don't evaluate this in our preprint. There is an example of doing this in generate_msa(start_query=True). Although I have not added the functionality to do this from a custom MSA, it is fairly straightforward.

To do this, all that is needed is to initiate an MSA with all-mask tokens, the first row should contain the query sequence of interest src[i][0][:seq_len] = query_sequences[i]. Then, you can generate over the non-query rows x_indices = np.arange(0,1).

And for your second question, correct the best way is to sample the function many times. Each time should sample a unique subset of sequences from the MSA.

Thank you for using evodiff and the feedback!