soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 200 forks source link

mmseqs result2repseq does not make a *_h file #146

Closed nick-youngblut closed 5 years ago

nick-youngblut commented 5 years ago

I'm trying to get the abundance of gene clusters generated by linclust. My method involves mapping the post-QC Illumina reads to the post-linclust cluster representatives via mmseqs map. To get the representative sequence db, I'm using mmseqs result2repseq. I ran mmseqs map (actually mmseqs search --alignment-mode 4 due to Issue #144), but after many hours of processing, I got the error that no "*_h" file exists for the database, and the map job died.

Do I have to convert the rep-seq database to a fasta and then re-create the database with mmseqs createdb just so that I can generate the *_h file? Is there a more efficient way?

Why doesn't mmseqs search check for the necessary files at the start of the job instead of in the middle of the run (possible after many hours of processing)?

martin-steinegger commented 5 years ago

@nick-youngblut yes this indeed not great default behavior. We will change it. Your computation time is not wasted. You can restart the job by just calling the same command again. But before you need to create the header file (_h') file by calling mmseqs createsubdb repSeqDb seqDb_h repSeqDb_h. Sorry for the inconvenience.

nick-youngblut commented 5 years ago

The restart behavior is quite nice. Thank you for explaining how to create the *_h file using mmseqs createsubdb. I didn't understand what createsubdb does, but I must have missed it in the wiki docs.