Open nvhphuc1206 opened 7 months ago
Hi Phuc,
Thanks for using KMCP.
I want the output table to contain results according to each desired rank, for example, is it possible to output only Genus ranks or only Species ranks?
You might add the output in CAMI format, where the abundance of each rank is provided.
-C, --cami-report string ► Save extra CAMI-like report.
Or MetaPhlan format, which might have more downstream analysis tools.
-M, --metaphlan-report string ► Save extra metaphlan-like report.
I want the results to only return one ref per species based on the number of reads or based on the score, is that possible?
You can add --level strain
.
--level string ► Level to estimate abundance at. Available values: species,
strain/assembly. (default "species")
Thanks Shenwei, but it seems my intended use is a bit more complicated
You might add the output in CAMI format or MetaPhlan format ...
I want to further use taxpasta after profiling with KMCP but currently taxpasta only supports main output of KMCP and I have tried using CAMI and MetaPhlan format output from KMCP but taxpasta gives an error so I am aiming to use KMCP's main.profile
. Therefore, I need to use the main output of KMCP, do you have any solution?
You can add --level strain.
I tried using --level strain
, but it seems that when I add this argument, the target number of reads is lost. Specifically, I used KMCP for a sample with 7500 simulated reads of Schaalia odontolytica and here are the results:
The results when I run with --level species
:
ref percentage coverage score chunksFrac chunksRelDepth chunksRelDepthStd reads ureads hicureads refsize refname taxid rank taxname taxpath taxpathsn
GCF_009730335.1 3.767191 0.391959 86.15 1 0.91;0.81;1.01;0.89;1.12;1.19;1.24;0.84;0.90;1.09 0.14 6357 6324 1869 2432635 1660 species Schaalia odontolytica Bacteria;Actinomycetota;Actinomycetes;Actinomycetales;Actinomycetaceae;Schaalia;Schaalia odontolytica 2;201174;1760;2037;2049;2529408;1660
GCF_005696695.1 0.395847 0.041186 79.23 0.8 1.01;1.04;1.12;1.46;1.05;1.03;0.90;0.63;1.05;0.71 0.22 648 644 67 2360133 1660 species Schaalia odontolytica Bacteria;Actinomycetota;Actinomycetes;Actinomycetales;Actinomycetaceae;Schaalia;Schaalia odontolytica 2;201174;1760;2037;2049;2529408;1660
-> There are 2 refs per Schaalia odontolytica with approximately the expected number of reads
The results when I run with --level strain
:
ref percentage coverage score chunksFrac chunksRelDepth chunksRelDepthStd reads ureads hicureads refsize refname taxid rank taxname taxpath taxpathsn
GCF_009730335.1 4.040299 0.405034 86.15 1 0.91;0.81;1.02;0.88;1.12;1.18;1.25;0.84;0.90;1.09 0.15 6569 6563 1975 2432635 1660 species Schaalia odontolytica Bacteria;Actinomycetota;Actinomycetes;Actinomycetales;Actinomycetaceae;Schaalia;Schaalia odontolytica 2;201174;1760;2037;2049;2529408;1660
-> There is only 1 ref per Schaalia odontolytica but the number of target reads has been decreased
--level strain
but still keep the number of reads like --level species
?I think the fastest way is to write a script to add the read count information of minor strains to the dominant strain for each species.
You can write it by yourself, or wait for me for a few days cause I'm busy recently. It would be easy, because all records are sorted in descending order of abundance.
use a map to store all the data, the key is the species name, the value is the list of all column data (elements) of one record.
read each line,
split it into a list of strings. The column data (reads number) could be converted to "int"
if this is the first record of one species:
save it to the map.
else:
add read number to the existing record.
optionally sort
iterate values of the map, and write to a new file.
Thank you very much,
I'll try to complete it on my own, but if you have the time, I think it will be really helpful for others who have the same purpose as me. I will also feel more secure when using that script because it is highly credible since it was written by you.
Thank you for your great support Shenwei
I assume that you only need the accumulated sum of reads for each species.
Please download the latest versions of csvtk and taxonkit. taxdump.tar.gz
is available on the kmcp download page.
We need the speciess name (not the column refname
) for all references
tar -zxvf taxdump.tar.gz
# taxid2species
taxonkit list --ids 1 --data-dir taxdump/ -I "" \
| taxonkit reformat --data-dir taxdump/ -I 1 -f '{s}' -o taxid2species.tsv
Add species names
cat main.profile \
| csvtk mutate -t -n species -f taxid \
| csvtk replace -t -k taxid2species.tsv -f species -p '(.+)' -r '{kv}' \
> main.profile2
Add the reads numbers for all species
cat main.profile2 \
| csvtk summary -t -g species -f reads:sum -w 0 \
> species2reads.tsv
Keep one reference for each species and update the reads number
cat main.profile2 \
| csvtk uniq -t -f species \
| csvtk replace -t -f species -k species2reads.tsv -p '(.+)' -r '{kv}' \
| csvtk cut -t -f 1-7,18,9-17 \
| csvtk rename -t -f 8 -n reads \
> main.profile3
csvtk transpose -t main.profile3 | csvtk pretty -Ht -Z -W 50 -x ';'
1 ref GCF_009730335.1
2 percentage 3.767191
3 coverage 0.391959
4 score 86.15
5 chunksFrac 1
6 chunksRelDepth 0.91;0.81;1.01;0.89;1.12;1.19;1.24;0.84;0.90;1.09
7 chunksRelDepthStd 0.14
8 reads 7005
9 ureads 6324
10 hicureads 1869
11 refsize 2432635
12 refname
13 taxid 1660
14 rank species
15 taxname Schaalia odontolytica
16 taxpath Bacteria;Actinomycetota;Actinomycetes;
Actinomycetales;Actinomycetaceae;Schaalia;
Schaalia odontolytica
17 taxpathsn 2;201174;1760;2037;2049;2529408;1660
Thank you very much Shenwei,
I will try using the method you instructed to process the output table from KMCP and perform the downstream analysis steps to see if everything is okay.
I sincerely appreciate your passionate support.
Dear Shenwei,
Thank you very much for providing a good tool, I am trying to adjust the output table as desired according to KMCP format.
Thank you so much,
Best,
Phuc