soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.39k stars 195 forks source link

[Question] Why are ^@ characters in the representative fasta #674

Open jolespin opened 1 year ago

jolespin commented 1 year ago

Expected Behavior

Current Behavior

Here's the rep_seq.fasta file:

>protein_A
MKNNSQDQQKLLKLLLQKKGISFKKVNTIPKRQSSNSELIPISLTQLELWFFAQFYPENC
IYNLPCIYRIEGLLNVPALEESLREIVKRHESLRTTFTCIDGKVFQKITDDPVFDFSILD
LQRLSEVEQKKETQRLLSAEINRSFDLEKDSLFCSKIIKFAENNYLLIMTIHHIIADGWS
LNILTKELGILYEAFCLNKPSPLPELPIQYGDFSLWQWQSIKDNSWQSQLNFWKKHIGIN
PPILKLPTDYPRTTAQPVEAAIHHFLISQNLTDALKSLSHQEKATLYMTLLAALKILLFY
YTEQTEIIIGGVAANRNQPETQDLIGLMVQFLPLYTHVEGQANFRDLLHQVKEMSLEFDA
YQEIPFITLVEKLKPIRDTQYALFHQVIFLFQNAPKQDLKFINLTVTDELLKMDIEPHSA
ENDLTISLEETEQGMKGAFVYKADLFTAETITQMEQYFQQLLKNIVTDPNKTISDLCPFS
ESVSFSDNTSPLSLSINEIKSNDLPHNLTRETMVNIWQEVLELEEISVNDNFFDLGGDSQ
KVIQVIDKIGKILKINCTLRDFFENPTVAR^@MTNHLFDLTGKVAIITGAARGIGRVLAQGLAQAGAKVVIGDINQVGAEQTVQLIQEAGGE
AIAIQTDVRQRQACQNLINQTVANYGQLDIMVCNAGVEILKNTDELEEFEWDQVINVDLK
GYFNCAQLATKQMIKQGTKGSIIMNSSICAFVAVPKSSGAYSAAKGGVNQLVKSLAVELA
SHKIRVNAFAPGYMNNMMEGTEGLRSTSDEMDELYTRIPMKRTGDLEELIGPVVFLASEA
SSYVTGAILMVDGGY^@MSKMNHNSQDKQKLLKLLLQKKGIGVKTNTIPTRNPSQLVPLSFSQERLWFLYQLEANGY
TYNMPFRFQIDGNLDVNIFRKALETIMQRHELLRTCFQEVDKTPRQIIKLKIQLNLPLLD
LLQELSHLYEAFDKNNLILDQISLFNTVIFPFGKDNSYFAYFTKSGFLT^@MFNLKHQLYLLISKFLEKQKRIEKHKKLSDNATFHSSVKFIGKCINYREDKTLIQIGENT
VIIGELAIFPFGGKIEIGRNCYIGEGTRIRSATSIKIGNEVIISDDVSIYDTDAHSLNYV
LRQKEFMEVLILNNLIKDAKDVDIQSAPVVIEDHVWIGFNVAILKGVTIGKGAIIGAGSV
VTKDVEPFTIVAGNPAKIIK^@MNVAEKTSDSINLKEDVRTFNQSLQEFLRFIQVYWYPKEHQDRAFSQIIRSWGMLVLLFL
LLVGLVGLNAFNSFVFRDLITFTEARDAEKLTHLVIIYAITLGSMTFFGGLSKFLKKLIA
LDWYQWINSSILQKYFKNRAYYQINFKGDIENPDQRLSQEIQPIARTTMDFLTTCVEKVM
EMLVFIAILWSISKTISIVLLVYTIIGNIVATYITQQLNKVNKQQLEIEGTYKYAMTHVR
THAESIAFFRGEEKELNIIQRKFNQVIKIMIERINWERTQEFFNRGYQSIVQIFPFLIVS
PLYISGEIEFGQVNQVSYCCYFFSTALSVLVDEFGRSREFINYIERLEEFYQALEGVSEQ
TNPVNTIKVIEDNNLAFDDVTLQTPDAAKVIVEHLSLSVEPGEGLLIVGPSGRGKSSLLR
AISGLWNTGTGHLVRPPLDDLLFLPQRPYIILGNLREQLIYPQTTTEMSESQLKEILQEV
NLQDVLNRIKNFDEEVDWENILSLGEQQRLAFARLFVNQPDFVILDEATSALDLKNEDHL
YKQLQQTGKTFISVGHRESLFNYHQRVLELSEDSTWRLMRMADYHPSTAIATHSNNEQTV
DETIEIVSEINNQDNFTHQEMQKLTNYSLSTIKNKARRGQTIVANDGMSYRYDK

Steps to Reproduce (for bugs)

mmseqs easy-linclust mmseqs_100/mmseqs2_rep_seq.gt11.fasta mmseqs_90/mmseqs2 tmp --min-seq-id 0.9 -c 0.8 --cov-mode 1 --dbtype 1

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

No errors

Context

Confusing about what the ^@ characters are doing. It looks like they are concatenating the proteins?

Your Environment

Include as many relevant details about the environment you experienced the bug in.

milot-mirdita commented 1 year ago

That's not supposed to happen. These are null bytes that separate entries in MMseqs2 databases. For some reason MMseqs2 read past an entry boundary and included the next entries too. Can you send us an (excerpt) of the input fasta file so we can try to debug please?

jolespin commented 1 year ago

Thanks for looking into this!

Here's the clustered file that produced the null bytes. It's relatively small. mmseqs2_rep_seq.gt11.fasta.gz

Note sure if it helps but I've been searching for null bytes with grep -Pa '\x00' [filename]

milot-mirdita commented 1 year ago

You can add the --createdb-mode 0 parameter as a workaround.

Edit: A space saving optimization is going wrong: The check for the optimization to work correctly depends on --dbtype not being set. The check should not depend on this parameter as its unrelated. Leaving out --dbtype should also fix the problem.

milot-mirdita commented 1 year ago

Should be now fixed in 6b93884.

jolespin commented 1 year ago

Awesome! Thank you for such the quick turn around. What is the best way to update my installation?

milot-mirdita commented 1 year ago

You don't need to. Either drop the --dbtype parameter or add --createdb-mode 0. Either should fix your issue.