soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 191 forks source link

easy-taxonomy for contigs #851

Closed TimothyStephens closed 3 months ago

TimothyStephens commented 3 months ago

Not sure if this is a bug or if I am missing a flag that would make this all work as expected.

Expected Behavior

I wish to taxonomically annotate contigs using the mmseqs easy-taxonomy workflow. I see from your documentation (https://github.com/soedinglab/MMseqs2/wiki#taxonomy-output-and-tsv) that it is possible to calculate the LCA of a contig predicted ORFs. With the output file produced listing the contig_name along with the total number of predicted ORFs and the number of those ORFs with top hits that agree with the assigned LCA of the contig.

Current Behavior

When I run the following command:

mmseqs easy-taxonomy contigs.fasta swissprotDB tax tmp

I get the following results files:

tax_lca.tsv
tax_report
tax_tophit_aln
tax_tophit_report

None of which contain the expected output described in the documentation.

I have had a look at using aggregatetax command, but run into problem with the createtsv command not reassigning the contig names correctly.

mmseqs createdb contigs.fasta contigsDb
mmseqs extractorfs contigsDb orfsAaDb --translate 1
mmseqs taxonomy orfsAaDb swissprotDB taxPerOrf tmp --tax-output-mode 2
mmseqs aggregatetaxweights swissprotDB orfsAaDb_h taxPerOrf taxPerOrf_aln taxPerContig --majority 0.5
mmseqs createtsv orfsAaDb contigsDb taxPerContig aggregatetaxResult.tsv

Your Environment

MMseqs2 Version: 113e3212c137d026e297c7540e1fcd039f6812b1
Pre-compiled binary

Thanks for your help in advance.

Cheers, Tim.

milot-mirdita commented 3 months ago

easy-taxonomy does already what you want

here is the output of one contig of tax_lca.tsv with an added header:

CONTIG_NAME TAX_ID TAX_RANK NAME                    TOTAL_FRAGS ASSIGNED_FRAGS FRAGS_AGREEMENT  AGREEMENT_RATIO
query       3702   species  Arabidopsis thaliana    2           2              2                1.000

From what I understand, the last few numbers is what you want right?

TimothyStephens commented 3 months ago

So, your results are exactly what I am looking for, however, when I run mmseqs easy-taxonomy bin.1.fa NR bin.1.tax tmp --threads 24, I get the following output.

$ head -n 4 bin.1.tax_lca.tsv
scaffold00022005    2590670 species Aestuariibacter sp. GS-14
scaffold00022061    1046    family  Chromatiaceae
scaffold00022138    2590670 species Aestuariibacter sp. GS-14
scaffold00022216    2590670 species Aestuariibacter sp. GS-14

For some reason, the 'FRAGS' columns are missing from my results file.

milot-mirdita commented 3 months ago

Can you post the first few lines of bin.1.fa please?

TimothyStephens commented 3 months ago
>scaffold00000068
CTGACGATCAGGCCGGCCGGATCATCGGCGCCGCGGTTGATCCGCAGGCCGGTCGCGAGG
CGCTGGAGGCGGGTGGAGAGGTCGGTGTTCGAGCGGTTCAGGTTGTTCTGCGCAATGAGC
GACGGAACATTCGTATTGATGCGAGCCATGGCAATCCTCCTTGATTGAGAGGCCCTTCGC
GGGAGCAGTCGTCAGGCGGTCTCGCGCCTGGAAGTCGGATTCGCCGCAAGTCGCGCGGCT
GCAGGCAGGACCGTCCCGCCCGGAGTGGCCTTCGCGCCGCGCGAACTCCCGCGCGGGCGG
AGTGAACCAAGCTCGCCACGGAACGCCGCGGCCTGCTGGTTCTCGTTCCTGATGGCGTCG
TACACCTCCTTGCGGTGCACCGCCACGGTTGCCGGCGCCTTGATGCCGATTCGCACCTTG
TCGCCGCGGATATCGACGATCGTGAGTTCGACGTCGTCGCCGATCATGATCGTCTCGTCG
ATCTGTCGTGAGAGCACCAGCATGTGCGGGCTCCTTTCTTGCGGGCATCCATGCCCGGTG
milot-mirdita commented 3 months ago

I just looked up when the commit you listed (113e3212c137d026e297c7540e1fcd039f6812b1) is is from and its ancient. please update to the latest release and everything should work fine.

TimothyStephens commented 3 months ago

Updating to the most recent version fixed the issue. My apologies, I didn't realize how old my version had gotten.