sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
460 stars 65 forks source link

It may be better to let the user define whether the same redundant is allowed or not #212

Closed wenyuhaokikika closed 1 year ago

wenyuhaokikika commented 1 year ago

bio_embeddings raise Exception

bio_embeddings.utilities.exceptions.MD5ClashException: There is at least one MD5 hash clash.
This most likely indicates there are multiple identical sequences in your FASTA file.
MD5 hashes are used to remap sequence identifiers from the input FASTA.
This error exists to prevent wasting resources (computing the same embedding twice).
There's a (very) low probability of this indicating a real MD5 clash.

i think it should allow user select, in yml config file. if i just has few redundant sequence, and it so difference ro record these redundant sequence.

sacdallago commented 1 year ago

Hi, there is already an option for that: https://github.com/sacdallago/bio_embeddings/issues/42#issuecomment-670849033