soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.35k stars 190 forks source link

Fasta entry 0 is invalid #446

Open Xacfran opened 3 years ago

Xacfran commented 3 years ago

Hi,

I'm trying to create a database before running a cluster analysis, however I'm getting a "Fasta entry 0 is invalid" error. I have tried creating databases in another fasta files and there's no problem whatsoever.

I'm using:

mmseqs createdb sequence.fasta sequenceDB

The database I'm having issues with contains over 5 millon sequences.

I will greatly appreciate any help!

milot-mirdita commented 3 years ago

Could you post the first few lines of the FASTA file (e.g. head -n 5 sequence.fasta)?

Xacfran commented 3 years ago

Thanks for such a quick reply. Sure, I'm pasting it below:

SpFus_HiC_scaffold_22:42949255-42950482 CTCCACCAATGCACACTCCAATCCACTCTAACAAACACTAAGTCTGCAAAAAATTATTTG ATGATTGAGTCAACCAGCAAATAGATGAatgaagaaaaccaaacaaacaaaaacctcagg AATGAGAAAACTGTAATTGATGGTGGGCAGGAAAtagattcaaatataaaataatgtacc tGTGACAAAATAGAAGCCAGAAGCTAAATATTAGGAAGGCTAGCAATGTAAGCATGATGG

milot-mirdita commented 3 years ago

The problem seems to be that MMseqs2 doesn't like the space between > and the rest of the name SpFus....

I am a bit surprised that this is the first time we are encountering this issue. Not quite sure what the correct behavior should be.

Xacfran commented 3 years ago

I didn't want to reply until I got the mmseqs cluster program running and I did, after I deleted the space you mentioned after ">" I was able to create my database. Thank you so much.

milot-mirdita commented 3 years ago

Leaving this open as we still have to decide what to do about the parsing issue.

cjprybol commented 1 month ago

I just hit this as well. The error message doesn't make the formatting problem obvious, but after finding this thread it was a quick and easy fix to manually correct the fasta lines that start with > to begin with >.

If writing the logic to handle parsing both > & > is more trouble than it is worth, revising the error message to indicate that the user may only need to fix the spacing issue and then try again would be very helpful! The file that caused the issue for me was downloaded from a small database maintained by an individual research lab, so I assume the issue was because the fasta file was likely manually updated at some point rather than machine-generated.

That's just a suggestion though - I was able to find this thread and solution within easily enough and there seem to be a handful of existing fasta validator tools that could also help users identify the root cause of their issue without internalizing that logic into this suite of tools as well.

P.S. thanks for MMSeqs!