Closed alvanuffelen closed 1 year ago
Hi!
Following the KMA-way of naming individual and PE reads: two (paired) sequence reads will form a 'fragment' (which is also called 'number of PE reads' in some studies).
I have no experience with the combination of PE reads and KMA, so I can't speak on that. But in the case of SE reads, that formula doesn't hold up. I have a simple test case (see attached) that contains a reference sequence and 3 query sequences that map to the reference sequence at 100% identity (adapted from an issue at KMA). The output for mapstat is:
## method KMA
## version 1.3.28
## database db
## fragmentCount 3
## date 2023-03-21
## command kma -i "query.fna" -a -o "mapped/results" -t_db db -1t1 -mem_mode -ef
# refSequence readCount fragmentCount mapScoreSum refCoveredPositions refConsensusSum bpTotal depthVariance nucHighDepthVariance depthMax snpSum insertSum deletionSum readCountAln fragmentCountAln
ref1A 3 3 240 240 240 240 0.057476 240 1 0 0 0 3 3
Using the formula, the total read count is 6 and the perc_map is 50%, which is incorrect. ref.fna query.fna
Hi,
Could you tell me the commands that you used to obtain the %50 fraction from CCMetagen and send me your .res and .mapstat files for this test?
Thanks!
The command I used is:
CCMetagen.py -i mapped/results.res -o results -ef y -map mapped/results.mapstat -d 0.01 -c 5
This results in:
| # refSequence | readCount | fragmentCount | mapScoreSum | refCoveredPositions | refConsensusSum | bpTotal | depthVariance | nucHighDepthVariance | depthMax | snpSum | insertSum | deletionSum | readCountAln | fragmentCountAln | perc_map |
|---------------|-----------|---------------|-------------|---------------------|-----------------|---------|---------------|----------------------|----------|--------|-----------|-------------|--------------|------------------|----------|
| 562|ref1A | 3 | 3 | 240 | 240 | 240 | 240 | 0.057476 | 240 | 1 | 0 | 0 | 0 | 3 | 3 | 50.0 |
Additionally, I noticed a second bug while using KMA v 1.3.28.
The .res
file is read in through:
df = pd.read_csv(f, sep='\t', index_col=0, encoding='latin1')
The problem is that the .res
file is not a properly tab-separated file when using this version of KMA:
So the index becomes 562|ref1A
.
This is a problem when using this index to filter df_stats
on line 363.
df_stats
is a DataFrame of .mapstat
which is properly tab-separated.
Therefore, df_stats_filt
will always be empty because the indices of df
and df_stats
are never the same due to the extra whitespaces.
kma_1_3_28.results.zip
Hi! Sorry for yet another question - which version of CCMetagen are you using? I wasn't able to reproduce the 50% issue - when running here I got that 100% of the reads were mapped to the database. We did fix a bug we had when calculating % coverage a while ago, which from memory only affected the displayed % but not the abundance table. If you do have an older version, could you try updating it?
And thanks for pointing out the issue with KMA v1.3.28, that is good to know! Please use version > 1.4 for these analyses.
Thanks!
Hi! I am using CCMetagen version 1.4.0.
I use the command:
CCMetagen.py -i mapped/results.res -o results -ef y -map mapped/results.mapstat -d 0.01 -c 5
which generates following output:
Reading file mapped/results.res
csv file saved as results.ccm.csv
Writing results.html...
krona file saved as results.html
calculating read mapping stats...
Stats file saved as results_stats.csv
Proportion of reads mapped to the database: 50.000000%
It is strange that you get 100%.
The formula for the total number of reads is total_reads = 2 * int(total_frags)
.
The total_frags
is taken from the .mapstat
file at line 4 (## fragmentCount 3
).
So the total_reads
is 6 according to the formula, but my query.fna
only contains 3 threads.
Therefore, the proportion of mapped reads is 50% instead of 100%.
Hi!
Ha, it turns out that I have an older version of CCMetagen! (new computer, something went wrong when syncing) So it seems we introduced that error in one of the updates. I'll check that in more detail (but probably only next week, sorry).
Meanwhile, if you use version 1.2.3 (the one I have atm) you should get the correct fraction. Let me know if that works.
Thanks for spotting this!
Indeed, using version 1.2.3 gives 100% (although in this version, it is mapped fragments instead of mapped reads). Thank you!
Hi again!
So - the problem you detected was introduced in our last CCMetagen version (the edit was made to fix an issue when the user does not force paired-end reads to map to the same template, but I didn't realise at the time that this would affect analyses using single-end reads).
I've now added a step to check whether the analyses were based on single or paired end reads - the updated version is 1.4.1.
Sorry for the inconvenience, and thanks a lot for reporting and helping me identify the issue.
Cheers
While calculating the read mapping stats, following formula is applied:
total_reads = 2 * int(total_frags)
How did you arrive at this formula?