soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.4k stars 195 forks source link

Expected files not created when seqTaxDB is made from existing BLAST database #401

Closed etowahadams closed 3 years ago

etowahadams commented 3 years ago

First of all, I must thank all the MMseqs contributers for all the excellent documentation and support! The wiki has been extremely helpful.

Problem: Expected seqTaxDB .dmp files not created

I followed the instructions in the wiki for creating a seqTaxDB from an existing BLAST database. I am using the NR database.

The documentation says that the following files should be created:

seqTaxDB, seqTaxDB.index, seqTaxDB.dbtype, seqTaxDB.lookup, seqTaxDB_h, seqTaxDB_h.index, seqTaxDB_h.dbtype, seqTaxDB_mapping, seqTaxDB_nodes.dmp, seqTaxDB_names.dmp, seqTaxDB_merged.dmp

However, am missing several of these files. Here are the files created. Notably, all the .dmp files are not being created.

nrDB.dbtype    nrDB.idx.1   nrDB.idx.2  nrDB.idx.7       nrDB.index  
nrDB_h         nrDB.idx.10  nrDB.idx.3  nrDB.idx.8       nrDB.lookup
nrDB_h.dbtype  nrDB.idx.11  nrDB.idx.4  nrDB.idx.9       nrDB_mapping 
nrDB_h.index   nrDB.idx.12  nrDB.idx.5  nrDB.idx.dbtype  nrDB.source
nrDB    nrDB.idx.0     nrDB.idx.13  nrDB.idx.6  nrDB.idx.index   nrDB_taxonomy

I know that this is at least a valid aminoacid database since I an search against it and get hits. However, I cannot use the taxonomyreport command on the results since it tells me that the result is an alignment database and not a taxonomy database. Similarily, when I run the taxonomyreport command with the nrDB as the result and seqTaxDB, it tells me that nrDB is an aminoacid database.

taxonomyreport ../nrDB ../nrDB report.html --report-mode 1

MMseqs Version: 6672bbc9de55e89b011c8a055982a2644d31a467
Report mode     1
Threads         20
Verbosity       3

Input database "../nrDB" has the wrong type (Aminoacid)
Allowed input:
- Taxonomy

I tried copying the .dmp files from the downloaded taxonomy into the same folder as my database, and renaming them to nrDB_merged.dmp, nrDB_names.dmp, and nrDB_nodes.dmp. My database is still not being recognized as a taxonomy database though.

createdb log file

createdb ../test/nr.fsa nrDB

MMseqs Version:         6672bbc9de55e89b011c8a055982a2644d31a467
Database type           0
Shuffle input database  true
Createdb mode           0
Write lookup file       1
Offset of numeric ids   0
Compressed              0
Verbosity               3

Converting sequences
[===================================================================================================    1 Mio. sequences processed

===================================================================================================     340 Mio. sequences processed
==============================

Time for merging to nrDB_h: 0h 2m 37s 499ms
Time for merging to nrDB: 0h 3m 51s 292ms
Database type: Aminoacid
Time for processing: 0h 45m 44s 356ms

createtaxdb log file

createtaxdb nrDB tmp --ncbi-tax-dump ../test/taxonomy/ --tax-mapping-file ../test/nr.fsa.taxidmapping

MMseqs Version:         6672bbc9de55e89b011c8a055982a2644d31a467
NCBI tax dump directory ../test/taxonomy/
Taxonomy mapping file   ../test/nr.fsa.taxidmapping
Taxonomy mapping mode   0
Taxonomy db mode        1
Threads                 36
Verbosity               3

Loading nodes file ... Done, got 2304309 nodes
Loading merged file ... Done, added 61039 merged nodes.
Loading names file ... Done
Init RMQ ...Done

Thanks for taking the time to look at this!

milot-mirdita commented 3 years ago

Sorry I didn't update the documentation. The _taxonomy file is something new, it contains all the .dmp files in a binary format that is instantly loadable (no more 5-7s delay every time the taxonomy database is used).

etowahadams commented 3 years ago

I see that makes sense. However, the resultDB from running a search against my nrDB is still an alignment database, and not a taxonomy database, so I cannot use the taxonomyreport command with it. Maybe the search command is not using the _taxonomy file correctly.

Edit: actually it looks like the taxonomyreport command is still looking for the .dmp files

taxonomyreport resultDB ../nrDB report.html --report-mode 1

MMseqs Version: 6672bbc9de55e89b011c8a055982a2644d31a467
Report mode     1
Threads         20
Verbosity       3

Input taxonomy database "resultDB" is missing files:
- resultDB_mapping
- resultDB_nodes.dmp
- resultDB_names.dmp
- resultDB_merged.dmp

But no resultDB_taxonomy is present either.

milot-mirdita commented 3 years ago

I think you are nearly there, you just have to swap the order or nrDB and resultDB then it should work. The difference between a taxonomy result and a taxonomy sequence database is not quite clear in the output of MMseqs2, we should improve that.

etowahadams commented 3 years ago

Thank you for the suggestion. I tried it and here is the log file:

taxonomyreport ../nrDB resultDB report.html --report-mode 1

MMseqs Version: 6672bbc9de55e89b011c8a055982a2644d31a467
Report mode     1
Threads         20
Verbosity       3

Input database "resultDB" has the wrong type (Alignment)
Allowed input:
- Taxonomy

Edit: also the usage for taxonomyreport says the following:

usage: mmseqs taxonomyreport <i:targetDB> <i:taxDB> <o:taxonomyReport> [options]

So according to this, I believe that nrDB should go second?

milot-mirdita commented 3 years ago

What command did you use to create the resultDB? Could you print the first few lines of one of the resultDB (head -n 5 resultDB.0)?

targetDB is a (taxonomy) sequence database and taxDB is a taxonomy result database. The terminology used in the code is confusing and should definitely be improved by us.

etowahadams commented 3 years ago

I used the following command, where queryDB is a sequence database created using a fasta file with 25 sequences.

mmseqs search test/query/queryDB ../nrDB resultDB tmp --num-iterations 3 --start-sens 1 --sens-steps 3 -s 7

Only three files were created as the result:

 1.5M Jan 30 11:52 resultDB
    4 Jan 30 11:52 resultDB.dbtype
  392 Jan 30 11:52 resultDB.index

If I do head -n 5 resultDB I get the following:

329161555       773     1.00    2.658E-249      0       382     383     0       382     383     383M
121000224       437     0.538   5.873E-133      0       382     383     0       389     390     217M7D166M
187343874       431     0.520   6.546E-131      0       382     383     0       389     390     217M7D166M
190748379       428     0.517   5.905E-130      0       382     383     0       389     390     217M7D166M
100067444       424     0.513   1.872E-128      0       382     383     0       392     393     219M8D80M2D84M

If I do head -n resultDB.index I get the following:

0       0       46663
1       46663   71880
2       118543  58035
3       176578  50784
4       227362  80826

I also tried searching against my nrDB using a profileDB, and similarly got only those three files (resultDB, resultDB.dbtype, and resultDB.index) as the result. Here is the command I used there:

mmseqs search ../profileDB /home/scratch60/new_ndDB/nrDB resultDB tmp --start-sens 1 --sens-steps 3 -s 7 -e 0.005 -a 1

I see that makes sense about taxDB!

Edit: also I don't know if I mentioned that I am using MMseqs2 Version: 6672bbc9de55e89b011c8a055982a2644d31a467 which was acquired and installed in mid-January. I am running it on my university's computer cluster (Red Hat Enterprise Linux Server 7.7 (Maipo)) which I ssh into.

milot-mirdita commented 3 years ago

Usually you would combine the taxonomy workflow with the taxonomyreport module:

mmseqs taxonomy queryDB nrDB resultDB tmp
mmseqs taxonomyreport queryDB nrDB report.html --report-mode 1

You could also assign one taxonomic label per search result by using either the lca or majoritylca modules, that result you could also give to taxonomyreport:

mmseqs search queryDB nrDB searchDB tmp
mmseqs lca nrDB searchDB taxDB # or majoritylca
mmseqs taxonomyreport nrDB taxDB report.html --report-mode 1

Doing this would however mean that you are not using our 2bLCA procedure that gives a stable label. Especially in combination with a profile search I would expect this to result in many labels very close to the root node.

EDIT: I don't you are missing anything important that happened since that commit you used.

etowahadams commented 3 years ago

I tried using the taxonomy workflow, using your instructions. I want to see a taxonomy report of my hits (resultDB) which came from searching nrDB, a seqTaxDB.

# usage: mmseqs taxonomy <i:queryDB> <i:targetDB> <o:taxaDB> <tmpDir> [options]
mmseqs taxonomy resultDB ../nrDB taxaDB tmp

However, I get the following warning:

Input database "resultDB" has the wrong type (Alignment)
Allowed input:
- Index
- Nucleotide
- Profile
- Aminoacid

Is it not possible to generate a taxonomy report from mmseqs search resultDB? The documentation says that

The result database of searching sequences against a seqTaxDB is referred to as taxonomyResult

but my resultDB from searching against a seqTaxDB (nrDB) is an Alignment type.

milot-mirdita commented 3 years ago

You have to supply a sequence database (e.g. test/query/queryDB) for the first parameter of the taxonomy.

That sentence is also not very precise. You cannot use taxonomyreport directly with a search result. Either you use the taxonomy workflow or one of the the lca approach I mentioned if you are sure that you want to include every single homologous hit found into the taxonomic label (usually not a good idea).

etowahadams commented 3 years ago

I see. So when the documentation says

The result database of searching sequences against a seqTaxDB is referred to as taxonomyResult.

"searching" does not refer to mmseqs search but rather the process that mmseqs taxonomy uses?

Here is the work-around I used previously to see the taxonomy of my search results. Does this look pretty normal to you, or is there a better way to do this? There seems to be no good way to go from an alignment database (search result) to a sequence database which can then can be used in the taxonomy workflow.

  1. get all the accession IDs of all my hits in resultDB with convertalis
  2. use blastdbcmd to get a fasta file containing sequences of all my hits
  3. then use easy-taxonomy to renerate a Kraken style report
milot-mirdita commented 3 years ago

Let me summarize: For a search result you want either a Kraken or Krona style report of all NR targets found for all queries? Do I understand this right?

etowahadams commented 3 years ago

Yes that is correct. I originally thought that I could use taxonomyreport for this purpose but was having trouble, which led me to believe that there was something wrong with my nrDB (a seqTaxDB), but I see now that I was mistaken with my understanding.

milot-mirdita commented 3 years ago

That use-case makes sense, I think I'll implement it soonish (so that you can do that directly with taxonomyreport). I have a different hack for you though how to achieve this in the meantime:

# you already have this resultDB
mmseqs search queryDB nrDB resultDB tmp
# generate a fake taxonomy result database using convertalis (create a database with the taxid in column 0)
mmseqs convertalis queryDB nrDB resultDB taxDB --format-output taxid --db-output 1
# fake the dbtype so this taxDB is recognized as a taxDB
awk 'BEGIN { printf("%c%c%c%c",8,0,0,0); exit; }' > taxDB.dbtype
# now we should be able to use taxonomyreport to create a Kraken report
mmseqs taxonomyreport nrDB taxDB report.tsv
# or a Krona report
mmseqs taxonomyreport nrDB taxDB report.html --report-mode 1
milot-mirdita commented 3 years ago

Okay sorry, this will not work after all, as taxonomyreport only reads the first entry and ignores the rest. I'll update you when I fix taxonomyreport to support your use-case.

etowahadams commented 3 years ago

Wow thank you so much!! I will keep an eye out for it. And thanks for answering my questions.

etowahadams commented 3 years ago

I can make a pull request with changes to the documentation based on our discussion if you'd like. Otherwise feel free to close this issue.

milot-mirdita commented 3 years ago

I think this should do what you want. Especially the Krona plots might get a bit overwhelming though (due to the potentially large number of alignments contributing to it).

Anyone should be able to contribute to the wiki, I think you can just edit it.

etowahadams commented 3 years ago

Amazing! I would install by compiling source, but my cluster only has g++ v9.8.5.

Are the static linux versions updated? I just tried reinstalling with

wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz 
tar xvzf mmseqs-linux-avx2.tar.gz
export PATH=$(pwd)/mmseqs/bin/:$PATH

but the changes aren't reflected there yet.

milot-mirdita commented 3 years ago

The static binaries take about one hour to build (running the tests take some time). GCC 9 would be completely sufficient to compile, we support (and check for) down to GCC 4.9 and clang 5.

etowahadams commented 3 years ago

I got the latest version. The new taxonomyreport works great, I can't thank you enough! I will close this issue now.