microsoft / evodiff

Generation of protein sequences and evolutionary alignments via discrete diffusion models
MIT License
483 stars 67 forks source link

KeyError: '!' during conditional generation #17

Closed cmwilson252 closed 1 month ago

cmwilson252 commented 10 months ago

Hello, I am experiencing a key error attempting to use the code 1 for 1 from the example notebook for conditional generation: https://github.com/microsoft/evodiff/blob/main/examples/evodiff.ipynb

from evodiff.pretrained import MSA_OA_DM_MAXSUB
from evodiff.generate_msa import generate_query_oadm_msa_simple
import re

checkpoint = MSA_OA_DM_MAXSUB()
model, collater, tokenizer, scheme = checkpoint

path_to_msa = 'bfd_uniclust_hits.a3m'
n_sequences=64 # number of sequences in MSA to subsample
seq_length=256 # maximum sequence length to subsample
selection_type='random' # or 'MaxHamming'; MSA subsampling scheme

tokeinzed_sample, generated_sequence  = generate_query_oadm_msa_simple(path_to_msa, model, tokenizer, n_sequences, seq_length, device='cpu', selection_type=selection_type)
print("New sequence (no gaps, pad tokens)", re.sub('[!-]', '', generated_sequence[0][0],))

The error can be traced back to:

evodiff/utils.py, line 247, in return np.array([self.a_to_i[a] for a in seq[0]]) # for nested lists

The alphabet seems to not know how to handle ! which should be the padding token. This alphabet appears to be imported from sequence_models.constants as MSA_ALPHABET.

Also this is much less important but I noticed there's three instances of "tokeinzed_sample" as a variable name in the example notebook that almost certainly are meant to be "tokenized_sample"

sherryliu987 commented 7 months ago

If you're struggling to install EvoDiff locally, feel free to try https://www.tamarind.bio/evodiff, a website which offers a no-code interface for bioinformatics tools including protein design with EvoDiff for free.

btroppo commented 3 months ago

@cmwilson252 did you end up finding a solution? i am experiencing the same problem now

btroppo commented 3 months ago

Note this is fixed by reducing n_sequences = to a number <= sequences in your MSA

Hlunlun commented 2 months ago

so does it must be .a3m file to input?

Hlunlun commented 2 months ago

yes, it must be .a3m file!!!