[Question] Why are ^@ characters in the representative fasta

jolespin commented 1 year ago

Expected Behavior

Current Behavior

Here's the rep_seq.fasta file:

>protein_A
MKNNSQDQQKLLKLLLQKKGISFKKVNTIPKRQSSNSELIPISLTQLELWFFAQFYPENC
IYNLPCIYRIEGLLNVPALEESLREIVKRHESLRTTFTCIDGKVFQKITDDPVFDFSILD
LQRLSEVEQKKETQRLLSAEINRSFDLEKDSLFCSKIIKFAENNYLLIMTIHHIIADGWS
LNILTKELGILYEAFCLNKPSPLPELPIQYGDFSLWQWQSIKDNSWQSQLNFWKKHIGIN
PPILKLPTDYPRTTAQPVEAAIHHFLISQNLTDALKSLSHQEKATLYMTLLAALKILLFY
YTEQTEIIIGGVAANRNQPETQDLIGLMVQFLPLYTHVEGQANFRDLLHQVKEMSLEFDA
YQEIPFITLVEKLKPIRDTQYALFHQVIFLFQNAPKQDLKFINLTVTDELLKMDIEPHSA
ENDLTISLEETEQGMKGAFVYKADLFTAETITQMEQYFQQLLKNIVTDPNKTISDLCPFS
ESVSFSDNTSPLSLSINEIKSNDLPHNLTRETMVNIWQEVLELEEISVNDNFFDLGGDSQ
KVIQVIDKIGKILKINCTLRDFFENPTVAR^@MTNHLFDLTGKVAIITGAARGIGRVLAQGLAQAGAKVVIGDINQVGAEQTVQLIQEAGGE
AIAIQTDVRQRQACQNLINQTVANYGQLDIMVCNAGVEILKNTDELEEFEWDQVINVDLK
GYFNCAQLATKQMIKQGTKGSIIMNSSICAFVAVPKSSGAYSAAKGGVNQLVKSLAVELA
SHKIRVNAFAPGYMNNMMEGTEGLRSTSDEMDELYTRIPMKRTGDLEELIGPVVFLASEA
SSYVTGAILMVDGGY^@MSKMNHNSQDKQKLLKLLLQKKGIGVKTNTIPTRNPSQLVPLSFSQERLWFLYQLEANGY
TYNMPFRFQIDGNLDVNIFRKALETIMQRHELLRTCFQEVDKTPRQIIKLKIQLNLPLLD
LLQELSHLYEAFDKNNLILDQISLFNTVIFPFGKDNSYFAYFTKSGFLT^@MFNLKHQLYLLISKFLEKQKRIEKHKKLSDNATFHSSVKFIGKCINYREDKTLIQIGENT
VIIGELAIFPFGGKIEIGRNCYIGEGTRIRSATSIKIGNEVIISDDVSIYDTDAHSLNYV
LRQKEFMEVLILNNLIKDAKDVDIQSAPVVIEDHVWIGFNVAILKGVTIGKGAIIGAGSV
VTKDVEPFTIVAGNPAKIIK^@MNVAEKTSDSINLKEDVRTFNQSLQEFLRFIQVYWYPKEHQDRAFSQIIRSWGMLVLLFL
LLVGLVGLNAFNSFVFRDLITFTEARDAEKLTHLVIIYAITLGSMTFFGGLSKFLKKLIA
LDWYQWINSSILQKYFKNRAYYQINFKGDIENPDQRLSQEIQPIARTTMDFLTTCVEKVM
EMLVFIAILWSISKTISIVLLVYTIIGNIVATYITQQLNKVNKQQLEIEGTYKYAMTHVR
THAESIAFFRGEEKELNIIQRKFNQVIKIMIERINWERTQEFFNRGYQSIVQIFPFLIVS
PLYISGEIEFGQVNQVSYCCYFFSTALSVLVDEFGRSREFINYIERLEEFYQALEGVSEQ
TNPVNTIKVIEDNNLAFDDVTLQTPDAAKVIVEHLSLSVEPGEGLLIVGPSGRGKSSLLR
AISGLWNTGTGHLVRPPLDDLLFLPQRPYIILGNLREQLIYPQTTTEMSESQLKEILQEV
NLQDVLNRIKNFDEEVDWENILSLGEQQRLAFARLFVNQPDFVILDEATSALDLKNEDHL
YKQLQQTGKTFISVGHRESLFNYHQRVLELSEDSTWRLMRMADYHPSTAIATHSNNEQTV
DETIEIVSEINNQDNFTHQEMQKLTNYSLSTIKNKARRGQTIVANDGMSYRYDK

Steps to Reproduce (for bugs)

mmseqs easy-linclust mmseqs_100/mmseqs2_rep_seq.gt11.fasta mmseqs_90/mmseqs2 tmp --min-seq-id 0.9 -c 0.8 --cov-mode 1 --dbtype 1

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

No errors

Context

Confusing about what the ^@ characters are doing. It looks like they are concatenating the proteins?

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters):
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.):
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Operating system and version:
```
MMseqs2 Version: 14.7e284
```

milot-mirdita commented 1 year ago

That's not supposed to happen. These are null bytes that separate entries in MMseqs2 databases. For some reason MMseqs2 read past an entry boundary and included the next entries too. Can you send us an (excerpt) of the input fasta file so we can try to debug please?

jolespin commented 1 year ago

Thanks for looking into this!

Here's the clustered file that produced the null bytes. It's relatively small. mmseqs2_rep_seq.gt11.fasta.gz

Note sure if it helps but I've been searching for null bytes with grep -Pa '\x00' [filename]

milot-mirdita commented 1 year ago

You can add the --createdb-mode 0 parameter as a workaround.

Edit: A space saving optimization is going wrong: The check for the optimization to work correctly depends on --dbtype not being set. The check should not depend on this parameter as its unrelated. Leaving out --dbtype should also fix the problem.

milot-mirdita commented 1 year ago

Should be now fixed in 6b93884.

jolespin commented 1 year ago

Awesome! Thank you for such the quick turn around. What is the best way to update my installation?

milot-mirdita commented 1 year ago

You don't need to. Either drop the --dbtype parameter or add --createdb-mode 0. Either should fix your issue.

soedinglab / MMseqs2