How to profile results only identify 1 species per reference

nvhphuc1206 commented 7 months ago

Dear Shenwei,

Thank you very much for providing a good tool, I am trying to adjust the output table as desired according to KMCP format.

Currently I use search with all the databases you provide and the output table contains more than one ref per species, so I want the results to only return one ref per species based on the number of reads or based on the score, is that possible?
I want the output table to contain results according to each desired rank, for example, is it possible to output only Genus ranks or only Species ranks?

Thank you so much,

Best,

Phuc

shenwei356 commented 7 months ago

Hi Phuc,

Thanks for using KMCP.

I want the output table to contain results according to each desired rank, for example, is it possible to output only Genus ranks or only Species ranks?

You might add the output in CAMI format, where the abundance of each rank is provided.

-C, --cami-report string                ► Save extra CAMI-like report.

Or MetaPhlan format, which might have more downstream analysis tools.

-M, --metaphlan-report string           ► Save extra metaphlan-like report.

I want the results to only return one ref per species based on the number of reads or based on the score, is that possible?

You can add --level strain.

  --level string                      ► Level to estimate abundance at. Available values: species,
                                      strain/assembly. (default "species")

nvhphuc1206 commented 7 months ago

Thanks Shenwei, but it seems my intended use is a bit more complicated

You might add the output in CAMI format or MetaPhlan format ...

I want to further use taxpasta after profiling with KMCP but currently taxpasta only supports main output of KMCP and I have tried using CAMI and MetaPhlan format output from KMCP but taxpasta gives an error so I am aiming to use KMCP's main.profile. Therefore, I need to use the main output of KMCP, do you have any solution?

You can add --level strain.

I tried using --level strain, but it seems that when I add this argument, the target number of reads is lost. Specifically, I used KMCP for a sample with 7500 simulated reads of Schaalia odontolytica and here are the results:

The results when I run with --level species:

ref percentage  coverage    score   chunksFrac  chunksRelDepth  chunksRelDepthStd   reads   ureads  hicureads   refsize refname taxid   rank    taxname taxpath taxpathsn
GCF_009730335.1 3.767191    0.391959    86.15   1   0.91;0.81;1.01;0.89;1.12;1.19;1.24;0.84;0.90;1.09   0.14    6357    6324    1869    2432635     1660    species Schaalia odontolytica   Bacteria;Actinomycetota;Actinomycetes;Actinomycetales;Actinomycetaceae;Schaalia;Schaalia odontolytica   2;201174;1760;2037;2049;2529408;1660
GCF_005696695.1 0.395847    0.041186    79.23   0.8 1.01;1.04;1.12;1.46;1.05;1.03;0.90;0.63;1.05;0.71   0.22    648 644 67  2360133     1660    species Schaalia odontolytica   Bacteria;Actinomycetota;Actinomycetes;Actinomycetales;Actinomycetaceae;Schaalia;Schaalia odontolytica   2;201174;1760;2037;2049;2529408;1660

-> There are 2 refs per Schaalia odontolytica with approximately the expected number of reads

The results when I run with --level strain:

ref percentage  coverage    score   chunksFrac  chunksRelDepth  chunksRelDepthStd   reads   ureads  hicureads   refsize refname taxid   rank    taxname taxpath taxpathsn
GCF_009730335.1 4.040299    0.405034    86.15   1   0.91;0.81;1.02;0.88;1.12;1.18;1.25;0.84;0.90;1.09   0.15    6569    6563    1975    2432635     1660    species Schaalia odontolytica   Bacteria;Actinomycetota;Actinomycetes;Actinomycetales;Actinomycetaceae;Schaalia;Schaalia odontolytica   2;201174;1760;2037;2049;2529408;1660

-> There is only 1 ref per Schaalia odontolytica but the number of target reads has been decreased

Is there a way to make the output table have only 1 ref per species like --level strain but still keep the number of reads like --level species?

shenwei356 commented 7 months ago

I think the fastest way is to write a script to add the read count information of minor strains to the dominant strain for each species.

You can write it by yourself, or wait for me for a few days cause I'm busy recently. It would be easy, because all records are sorted in descending order of abundance.

use a map to store all the data, the key is the species name, the value is the list of all column data (elements) of one record.
read each line,
    split it into a list of strings. The column data (reads number) could be converted to "int"
    if this is the first record of one species:
        save it to the map. 
    else:
        add read number to the existing record.
optionally sort
iterate values of the map, and write to a new file.

nvhphuc1206 commented 7 months ago

Thank you very much,

I'll try to complete it on my own, but if you have the time, I think it will be really helpful for others who have the same purpose as me. I will also feel more secure when using that script because it is highly credible since it was written by you.

Thank you for your great support Shenwei

shenwei356 commented 7 months ago

I assume that you only need the accumulated sum of reads for each species.

Please download the latest versions of csvtk and taxonkit. taxdump.tar.gz is available on the kmcp download page.

We need the speciess name (not the column refname) for all references

tar -zxvf taxdump.tar.gz

# taxid2species
taxonkit  list --ids 1 --data-dir taxdump/ -I "" \
    | taxonkit reformat --data-dir taxdump/ -I 1 -f '{s}' -o taxid2species.tsv

Add species names

cat main.profile \
    | csvtk mutate -t -n species -f taxid \
    | csvtk replace -t -k taxid2species.tsv -f species -p '(.+)' -r '{kv}' \
    > main.profile2

Add the reads numbers for all species

cat main.profile2 \
    | csvtk summary -t -g species -f reads:sum -w 0 \
    > species2reads.tsv

Keep one reference for each species and update the reads number

cat main.profile2 \
    | csvtk uniq -t -f species \
    | csvtk replace -t -f species -k species2reads.tsv -p '(.+)' -r '{kv}' \
    | csvtk cut -t -f 1-7,18,9-17 \
    | csvtk rename -t -f 8 -n reads \
    > main.profile3

csvtk transpose -t main.profile3 | csvtk pretty -Ht -Z -W 50 -x ';'

1    ref                 GCF_009730335.1                                   
2    percentage          3.767191                                          
3    coverage            0.391959                                          
4    score               86.15                                             
5    chunksFrac          1                                                 
6    chunksRelDepth      0.91;0.81;1.01;0.89;1.12;1.19;1.24;0.84;0.90;1.09 
7    chunksRelDepthStd   0.14                                              
8    reads               7005                                              
9    ureads              6324                                              
10   hicureads           1869                                              
11   refsize             2432635                                           
12   refname                                                               
13   taxid               1660                                              
14   rank                species                                           
15   taxname             Schaalia odontolytica                             
16   taxpath             Bacteria;Actinomycetota;Actinomycetes;            
                         Actinomycetales;Actinomycetaceae;Schaalia;        
                         Schaalia odontolytica                             
17   taxpathsn           2;201174;1760;2037;2049;2529408;1660

nvhphuc1206 commented 7 months ago

Thank you very much Shenwei,

I will try using the method you instructed to process the output table from KMCP and perform the downstream analysis steps to see if everything is okay.

I sincerely appreciate your passionate support.

shenwei356 / kmcp

How to profile results only identify 1 species per reference #44