Closed etowahadams closed 3 years ago
Sorry I didn't update the documentation. The _taxonomy
file is something new, it contains all the .dmp
files in a binary format that is instantly loadable (no more 5-7s delay every time the taxonomy database is used).
I see that makes sense. However, the resultDB from running a search against my nrDB is still an alignment database, and not a taxonomy database, so I cannot use the taxonomyreport
command with it. Maybe the search command is not using the _taxonomy
file correctly.
Edit: actually it looks like the taxonomyreport
command is still looking for the .dmp
files
taxonomyreport resultDB ../nrDB report.html --report-mode 1
MMseqs Version: 6672bbc9de55e89b011c8a055982a2644d31a467
Report mode 1
Threads 20
Verbosity 3
Input taxonomy database "resultDB" is missing files:
- resultDB_mapping
- resultDB_nodes.dmp
- resultDB_names.dmp
- resultDB_merged.dmp
But no resultDB_taxonomy
is present either.
I think you are nearly there, you just have to swap the order or nrDB
and resultDB
then it should work. The difference between a taxonomy result and a taxonomy sequence database is not quite clear in the output of MMseqs2, we should improve that.
Thank you for the suggestion. I tried it and here is the log file:
taxonomyreport ../nrDB resultDB report.html --report-mode 1
MMseqs Version: 6672bbc9de55e89b011c8a055982a2644d31a467
Report mode 1
Threads 20
Verbosity 3
Input database "resultDB" has the wrong type (Alignment)
Allowed input:
- Taxonomy
Edit: also the usage for taxonomyreport
says the following:
usage: mmseqs taxonomyreport <i:targetDB> <i:taxDB> <o:taxonomyReport> [options]
So according to this, I believe that nrDB should go second?
What command did you use to create the resultDB
? Could you print the first few lines of one of the resultDB
(head -n 5 resultDB.0
)?
targetDB
is a (taxonomy) sequence database and taxDB
is a taxonomy result database. The terminology used in the code is confusing and should definitely be improved by us.
I used the following command, where queryDB
is a sequence database created using a fasta file with 25 sequences.
mmseqs search test/query/queryDB ../nrDB resultDB tmp --num-iterations 3 --start-sens 1 --sens-steps 3 -s 7
Only three files were created as the result:
1.5M Jan 30 11:52 resultDB
4 Jan 30 11:52 resultDB.dbtype
392 Jan 30 11:52 resultDB.index
If I do head -n 5 resultDB
I get the following:
329161555 773 1.00 2.658E-249 0 382 383 0 382 383 383M
121000224 437 0.538 5.873E-133 0 382 383 0 389 390 217M7D166M
187343874 431 0.520 6.546E-131 0 382 383 0 389 390 217M7D166M
190748379 428 0.517 5.905E-130 0 382 383 0 389 390 217M7D166M
100067444 424 0.513 1.872E-128 0 382 383 0 392 393 219M8D80M2D84M
If I do head -n resultDB.index
I get the following:
0 0 46663
1 46663 71880
2 118543 58035
3 176578 50784
4 227362 80826
I also tried searching against my nrDB
using a profileDB, and similarly got only those three files (resultDB
, resultDB.dbtype
, and resultDB.index
) as the result. Here is the command I used there:
mmseqs search ../profileDB /home/scratch60/new_ndDB/nrDB resultDB tmp --start-sens 1 --sens-steps 3 -s 7 -e 0.005 -a 1
I see that makes sense about taxDB
!
Edit: also I don't know if I mentioned that I am using MMseqs2 Version: 6672bbc9de55e89b011c8a055982a2644d31a467
which was acquired and installed in mid-January. I am running it on my university's computer cluster (Red Hat Enterprise Linux Server 7.7 (Maipo)) which I ssh into.
Usually you would combine the taxonomy
workflow with the taxonomyreport
module:
mmseqs taxonomy queryDB nrDB resultDB tmp
mmseqs taxonomyreport queryDB nrDB report.html --report-mode 1
You could also assign one taxonomic label per search result by using either the lca
or majoritylca
modules, that result you could also give to taxonomyreport
:
mmseqs search queryDB nrDB searchDB tmp
mmseqs lca nrDB searchDB taxDB # or majoritylca
mmseqs taxonomyreport nrDB taxDB report.html --report-mode 1
Doing this would however mean that you are not using our 2bLCA procedure that gives a stable label. Especially in combination with a profile search I would expect this to result in many labels very close to the root node.
EDIT: I don't you are missing anything important that happened since that commit you used.
I tried using the taxonomy
workflow, using your instructions. I want to see a taxonomy report of my hits (resultDB
) which came from searching nrDB
, a seqTaxDB
.
# usage: mmseqs taxonomy <i:queryDB> <i:targetDB> <o:taxaDB> <tmpDir> [options]
mmseqs taxonomy resultDB ../nrDB taxaDB tmp
However, I get the following warning:
Input database "resultDB" has the wrong type (Alignment)
Allowed input:
- Index
- Nucleotide
- Profile
- Aminoacid
Is it not possible to generate a taxonomy report from mmseqs search
resultDB? The documentation says that
The result database of searching sequences against a
seqTaxDB
is referred to astaxonomyResult
but my resultDB from searching against a seqTaxDB
(nrDB
) is an Alignment type.
You have to supply a sequence database (e.g. test/query/queryDB
) for the first parameter of the taxonomy
.
That sentence is also not very precise. You cannot use taxonomyreport
directly with a search result. Either you use the taxonomy
workflow or one of the the lca
approach I mentioned if you are sure that you want to include every single homologous hit found into the taxonomic label (usually not a good idea).
I see. So when the documentation says
The result database of searching sequences against a seqTaxDB is referred to as
taxonomyResult
.
"searching" does not refer to mmseqs search
but rather the process that mmseqs taxonomy
uses?
Here is the work-around I used previously to see the taxonomy of my search results. Does this look pretty normal to you, or is there a better way to do this? There seems to be no good way to go from an alignment database (search result) to a sequence database which can then can be used in the taxonomy
workflow.
resultDB
with convertalis
blastdbcmd
to get a fasta file containing sequences of all my hitseasy-taxonomy
to renerate a Kraken style reportLet me summarize: For a search result you want either a Kraken or Krona style report of all NR targets found for all queries? Do I understand this right?
Yes that is correct. I originally thought that I could use taxonomyreport
for this purpose but was having trouble, which led me to believe that there was something wrong with my nrDB
(a seqTaxDB), but I see now that I was mistaken with my understanding.
That use-case makes sense, I think I'll implement it soonish (so that you can do that directly with taxonomyreport
).
I have a different hack for you though how to achieve this in the meantime:
# you already have this resultDB
mmseqs search queryDB nrDB resultDB tmp
# generate a fake taxonomy result database using convertalis (create a database with the taxid in column 0)
mmseqs convertalis queryDB nrDB resultDB taxDB --format-output taxid --db-output 1
# fake the dbtype so this taxDB is recognized as a taxDB
awk 'BEGIN { printf("%c%c%c%c",8,0,0,0); exit; }' > taxDB.dbtype
# now we should be able to use taxonomyreport to create a Kraken report
mmseqs taxonomyreport nrDB taxDB report.tsv
# or a Krona report
mmseqs taxonomyreport nrDB taxDB report.html --report-mode 1
Okay sorry, this will not work after all, as taxonomyreport
only reads the first entry and ignores the rest.
I'll update you when I fix taxonomyreport
to support your use-case.
Wow thank you so much!! I will keep an eye out for it. And thanks for answering my questions.
I can make a pull request with changes to the documentation based on our discussion if you'd like. Otherwise feel free to close this issue.
I think this should do what you want. Especially the Krona plots might get a bit overwhelming though (due to the potentially large number of alignments contributing to it).
Anyone should be able to contribute to the wiki, I think you can just edit it.
Amazing! I would install by compiling source, but my cluster only has g++ v9.8.5.
Are the static linux versions updated? I just tried reinstalling with
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
tar xvzf mmseqs-linux-avx2.tar.gz
export PATH=$(pwd)/mmseqs/bin/:$PATH
but the changes aren't reflected there yet.
The static binaries take about one hour to build (running the tests take some time). GCC 9 would be completely sufficient to compile, we support (and check for) down to GCC 4.9 and clang 5.
I got the latest version. The new taxonomyreport
works great, I can't thank you enough! I will close this issue now.
First of all, I must thank all the MMseqs contributers for all the excellent documentation and support! The wiki has been extremely helpful.
Problem: Expected seqTaxDB
.dmp
files not createdI followed the instructions in the wiki for creating a seqTaxDB from an existing BLAST database. I am using the NR database.
The documentation says that the following files should be created:
However, am missing several of these files. Here are the files created. Notably, all the
.dmp
files are not being created.I know that this is at least a valid aminoacid database since I an search against it and get hits. However, I cannot use the
taxonomyreport
command on the results since it tells me that the result is an alignment database and not a taxonomy database. Similarily, when I run thetaxonomyreport
command with the nrDB as the result and seqTaxDB, it tells me that nrDB is an aminoacid database.I tried copying the
.dmp
files from the downloaded taxonomy into the same folder as my database, and renaming them tonrDB_merged.dmp
,nrDB_names.dmp
, andnrDB_nodes.dmp
. My database is still not being recognized as a taxonomy database though.createdb log file
createtaxdb log file
Thanks for taking the time to look at this!