soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.39k stars 195 forks source link

is mmseqs latest version available to use? #387

Open Leran10 opened 3 years ago

Leran10 commented 3 years ago

Hey I'm reading this paper https://www.biorxiv.org/content/10.1101/2020.11.27.401018v1.full.pdf I'm trying to get the latest version of mmseqs with the feature --majority but I cannot see where to download it. Is it available now? thank you!

milot-mirdita commented 3 years ago

No official release quite yet, but the latest git version already contains the new algorithm. You can either compile from source or use the precompiled static binaries at mmseqs.com/latest. We'll also make a release soonish.

Leran10 commented 3 years ago

thanks for your reply!!

Leran10 commented 3 years ago

Hi @milot-mirdita,

I have successfully installed the precompiled version of the latest mmseqs. But when I run my script It keeps throwing me an error says:

Input database "./results/contigs/mmseqs_aa_out/taxonomyResult" is wrong!
Current input: Taxonomy. Allowed input: Alignment

when I switch to the old version, I don't see this error.

Is there any advice on that?

Thanks very much!! Leran

milot-mirdita commented 3 years ago

I assume you are using the easy-taxonomy? I think it's currently broken, I am in the process of fixing that again. I'll update you once everything works again.

Leran10 commented 3 years ago

Hi @milot-mirdita,

Actually I'm not using the easy-taxonomy module. This is what I used in my previous script:

/home/leranwang/mmseqs/bin/mmseqs lca $DB $OUT/taxonomyResult $OUT/lcaDB --tax-lineage true \
--lca-ranks "superkingdom:phylum:class:order:family:genus:species";

And here is what I modified:

/home/leranwang/mmseqs/bin/mmseqs majoritylca $DB $OUT/taxonomyResult $OUT/lcaDB --tax-lineage true \
--lca-ranks "superkingdom:phylum:class:order:family:genus:species";

So as you can see, I just switched from "lca" to "majoritylca", then I got this error. I was wondering if there were other modifications in my previous steps required before using this majoritylca module?

For our reference, the modules I have been using in my script are (in order):

mmseqs createdb mmseqs taxonomy mmseqs convertalis mmseqs majoritylca mmseqs createtsv mmseqs taxonomyreport

Thank you! Leran

milot-mirdita commented 3 years ago

I just pushed my recent changes. Could you try again with the latest git version? I think there were a couple issues, which should be hopefully resolved.

majoirtylca is still kind of new, I am not sure it's applicable to your use-case. It does need indeed an alignment result as input and not a lca result.

Leran10 commented 3 years ago

Thanks @milot-mirdita! After reinstalling the updated version from git, that error is gone. But I got a new one says:

majoritylca /scratch/ref/hecatomb_databases/virus_uniprot/targetDB ./results/contigs/mmseqs_aa_out/taxonomyResult ./results/contigs/mmseqs_aa_out/lcaDB --tax-lineage 1 --lca-ranks superkingdom,phylum,class,order,family,genus,species 

MMseqs Version:                 6672bbc9de55e89b011c8a055982a2644d31a467
Majority threshold              0.5
Vote mode                       1
LCA ranks                       superkingdom,phylum,class,order,family,genus,species
Taxon blacklist                 12908,28384
Column with taxonomic lineage   1
Compressed                      0
Threads                         24
Verbosity                       3

Loading NCBI taxonomy
Loading nodes file ... Done, got 2227973 nodes
Loading merged file ... Done, added 56613 merged nodes.
Loading names file ... Done
Init RMQ ...Done
Computing LCA
[====taxonid: 1046081 does not match a legal taxonomy node.
================================taxonid: 1046081 does not match a legal taxonomy node.
===============taxonid: 1046081 does not match a legal taxonomy node.
=============

So..do you have any suggestion on this one?

Thank you!! Leran

Leran10 commented 3 years ago

Hi @milot-mirdita ,

I found this 1046081 in our targetDB.lookup file, which looks like this:

>grep 1046081 targetDB.lookup
1046081 A0A0U2C0X9  0

I wondered if the new version of mmseqs expects this file to be in a different configuration? Like maybe no row number, but directly "A0A0U2C0X9"? Or what format of the "legal taxonomy node" it looks for?

Thank you! Leran

milot-mirdita commented 3 years ago

This looks about right. The .lookup contains mmseqs database key, accession, set id. The set id is referenced in the .source file. There you will find a list of set id and filename (the filename that was given to createdb).

How does the corresponding _mapping entry look? I would expect it to look like this (mmseqs database key tab taxonomy id):

>grep 1046081 targetDB_mapping
1046081\t2560625

(Sorry for the slow answers, was a quite busy start into the new year.)

Leran10 commented 3 years ago

Hi @milot-mirdita, so sorry I was in something else. I'll definitely check this and see what we have. Thanks so much!

Leran10 commented 3 years ago

Hi @milot-mirdita, I tried your suggestion, but what I got was not as we expected:

grep 1046081 targetDB_mapping 14390 1046081 14791 1046081 54727 1046081 54821 1046081 67878 1046081 69099 1046081 108245 1046081 109607 1046081 110046 1046081 122069 1046081 122322 1046081 162216 1046081 162665 1046081 163469 1046081 176042 1046081 176923 1046081 216864 1046081 217968 1046081 229254 1046081 270710 1046081 271836 1046081 283663 1046081 284074 1046081 285161 1046081 285169 1046081 391183 1046081 436065 1046081 445787 1046081 460918 1046081 500264 1046081 500504 1046081 541797 1046081 553721 1046081 594524 1046081 595557 1046081 607272 1046081 608982 1046081 661008 1046081 661307 1046081 662949 1046081 716910 1046081 756016 1046081 810381 1046081 810751 1046081 811520 1046081 822152 1046081 823757 1046081 838700 1046081 838702 1046081 865061 1046081 867821 1046081 876434 1046081 877320 1046081 878516 1046081 878706 1046081 930957 1046081 972997 1046081 975592 1046081 985653 1046081 985782 1046081 1027284 1046081 1038026 1046081 1039227 1046081 1046081 1647282 1079684 1046081 1080241 1046081 1080934 1046081 1081272 1046081 1083388 1046081 1093920 1046081 1094260 1046081 1134096 1046081 1135238 1046081 1145518 1046081 1147729 1046081 1148216 1046081 1187614 1046081 1188425 1046081 1201219 1046081 1201699 1046081 1241735 1046081 1242143 1046081 1242394 1046081 1243384 1046081 1243671 1046081 1270443 1046081 1295835 1046081 1308040 1046081 1308929 1046081 1349672 1046081 1351320 1046081 1403774 1046081 1405289 1046081 1470650 1046081 1511580 1046081 1511688 1046081 1513432 1046081 1524930 1046081 1525999 1046081 1526004 1046081 1526442 1046081 1566745 1046081 1577418 1046081 1579970 1046081 1580211 1046081 1623216 1046081 1631386 1046081 1631675 1046081 1632547 1046081 1633189 1046081 1634365 1046081 1648224 1046081 1675129 1046081 1687564 1046081

Could this result provide any clue for us on where went wrong?

Thank you very much and sorry for my late response!

Leran

milot-mirdita commented 3 years ago

Could you summarize what your targetDB exactly is and how you built it? Not really sure what's going on there just from the output you pasted. Seems like something went wrong during its construction.

Leran10 commented 3 years ago

Hi @milot-mirdita,

I'm not sure if this information would help to find the issue:

We have mixture of headers, one is from UniRef50 and the other one is from viral database which look like this:

tr|Q7Y3F5|Q7Y3F5_9CAUD DNA-directed DNA polymerase OS=Streptococcus virus C1 OX=230871 GN=orf7 PE=4 SV=1
tr|A9D9W5|A9D9W5_9CAUD Uncharacterized protein OS=Lactobacillus prophage Lj771 OX=139871 GN=LJ771_044 PE=4 SV=1
tr|A9D9U1|A9D9U1_9CAUD Uncharacterized protein OS=Lactobacillus prophage Lj771 OX=139871 GN=LJ771_036 PE=4 SV=1
tr|K4NWE0|K4NWE0_9PICO Genome polyprotein OS=WUHARV Sapelovirus 1 OX=1245562 PE=4 SV=1

And they are UniProt, not UniRef.

Would this be an issue that caused the problems? Could you let me know what information we would need to add to make it work?

Thanks so much!

milot-mirdita commented 3 years ago

What steps did you exactly take to turn this fasta file into a MMseqs2 database?

Would be helpful to know so we can find out where the construction might have initially gone wrong.

Leran10 commented 3 years ago

We downloaded all of the UniProt entries for viruses from the UniProt website, and clustered them with CD-Hit at with -c 0.99. Then we run mmseqs createdb on that file.