Open Xacfran opened 3 years ago
Could you post the first few lines of the FASTA file (e.g. head -n 5 sequence.fasta
)?
Thanks for such a quick reply. Sure, I'm pasting it below:
SpFus_HiC_scaffold_22:42949255-42950482 CTCCACCAATGCACACTCCAATCCACTCTAACAAACACTAAGTCTGCAAAAAATTATTTG ATGATTGAGTCAACCAGCAAATAGATGAatgaagaaaaccaaacaaacaaaaacctcagg AATGAGAAAACTGTAATTGATGGTGGGCAGGAAAtagattcaaatataaaataatgtacc tGTGACAAAATAGAAGCCAGAAGCTAAATATTAGGAAGGCTAGCAATGTAAGCATGATGG
The problem seems to be that MMseqs2 doesn't like the space between >
and the rest of the name SpFus...
.
I am a bit surprised that this is the first time we are encountering this issue. Not quite sure what the correct behavior should be.
I didn't want to reply until I got the mmseqs cluster program running and I did, after I deleted the space you mentioned after ">" I was able to create my database. Thank you so much.
Leaving this open as we still have to decide what to do about the parsing issue.
I just hit this as well. The error message doesn't make the formatting problem obvious, but after finding this thread it was a quick and easy fix to manually correct the fasta lines that start with >
to begin with >
.
If writing the logic to handle parsing both >
& >
is more trouble than it is worth, revising the error message to indicate that the user may only need to fix the spacing issue and then try again would be very helpful! The file that caused the issue for me was downloaded from a small database maintained by an individual research lab, so I assume the issue was because the fasta file was likely manually updated at some point rather than machine-generated.
That's just a suggestion though - I was able to find this thread and solution within easily enough and there seem to be a handful of existing fasta validator tools that could also help users identify the root cause of their issue without internalizing that logic into this suite of tools as well.
P.S. thanks for MMSeqs!
Hi,
I'm trying to create a database before running a cluster analysis, however I'm getting a "Fasta entry 0 is invalid" error. I have tried creating databases in another fasta files and there's no problem whatsoever.
I'm using:
mmseqs createdb sequence.fasta sequenceDB
The database I'm having issues with contains over 5 millon sequences.
I will greatly appreciate any help!