shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
176 stars 13 forks source link

Metaphlan format profile report file seems to lack NCBI taxonomy IDs. #4

Closed ilnamkang closed 2 years ago

ilnamkang commented 2 years ago

Thank you for a great tool!

I'd like to use Metaphlan format profile report file obtained from "kmcp profile" as an input of Pavian, which draws Sankey plots. ( https://github.com/fbreitwieser/pavian ) ( https://fbreitwieser.shinyapps.io/pavian/ )

But, when I tried, the Metaphlan format output file of "kmcp profile" is not recognized by Pavian shiny app, which claims that it can recognize Kraken, Centrifuge and MetaPhlAn report files.

When I compared my output file to the original MetaPhlAn3 output format (https://github.com/biobakery/biobakery/wiki/metaphlan3#output-files), it seems that kmcp output file does not have "NCBI_tax_id" column, which seems to be required by Pavian.

Are there any ways to solve this problem?

Thanks in advance.

shenwei356 commented 2 years ago

Hi, thanks for using KMCP. I've added a flag --metaphlan-report-version to choose the MetaPhlAn format version, and the default is 3, so you don't need change the commands.

The format should be OK now:

#SampleID   
#clade_name NCBI_tax_id relative_abundance  additional_species
k__Bacteria 2   100.000000  
k__Bacteria|p__Proteobacteria   1224    99.530189   
k__Bacteria|p__Verrucomicrobia  74201   0.469811    
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria    1236    99.530189   
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae  203494  0.469811    
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales    91347   99.530189   
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales    48461   0.469811    
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae  543 99.530189   
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales|f__Akkermansiaceae 1647988 0.469811    
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia   561 99.530189   
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales|f__Akkermansiaceae|g__Akkermansia  239934  0.469811    
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli   562 99.530189   
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales|f__Akkermansiaceae|g__Akkermansia|s__Akkermansia muciniphila   239935  0.469811    
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli|t__Escherichia coli SE15  431946  48.321535   
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli|t__Escherichia coli K-12  83333   46.194629   
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli|t__Escherichia coli O157:H7 str. Sakai    386585  5.014025    
k__Bacteria|p__Verrucomicrobia|c__Verrucomicrobiae|o__Verrucomicrobiales|f__Akkermansiaceae|g__Akkermansia|s__Akkermansia muciniphila|t__Akkermansia muciniphila ATCC BAA-835   349741  0.469811    
ilnamkang commented 2 years ago

Thank you for a quick update!!

I've rerun "kmcp profile" using the updated binary. Now, the result file contains "NCBI_tax_id".

But, this updated output file was still not recognized by Pavian. Therefore, I looked the Pavian repository to see if there are any hints for the solution.

In my opinion, Pavian seems to require specific strings in the first line of input files of MetaPhlAn format.

I found these lines in one of R script files of the Pavian repository. (https://github.com/fbreitwieser/pavian/blob/master/R/datainput-read_report.R) ---- is_metaphlan3_fmt <- grepl("^#mpa_v3", first.line) is_metaphlan_fmt <- grepl("Metaphlan2_Analysis$", first.line) ----

So, I guessed that input files of MetaPhlAn3 format should start with "#mpa_v3" and those of MetaPhlAn2 format should have the first line ending with with "Metaphlan2_Analysis".

Therefore, I tried inserting "#mpa_v3" in the first line of the kmcp output file as below. This edited file is recognized by Pavian without any errors.

----

mpa_v3

SampleID

clade_name NCBI_tax_id relative_abundance additional_species

kBacteria 2 100.000000
k
Bacteria|pProteobacteria 1224 57.368930
k
Bacteria|pActinobacteria 201174 32.664842
k
Bacteria|p__Bacteroidetes 976 4.010597
----

Thanks for your help.

shenwei356 commented 2 years ago

I see, the MetaPhlAn3 outputs the database info as the first line in the profile:

#mpa_v30_CHOCOPhlAn_201901