vrmarcelino / CCMetagen

Microbiome classification pipeline
GNU General Public License v3.0
63 stars 19 forks source link

Total reads equal to twice the number of fragments? #47

Closed alvanuffelen closed 1 year ago

alvanuffelen commented 1 year ago

While calculating the read mapping stats, following formula is applied: total_reads = 2 * int(total_frags) How did you arrive at this formula?

vrmarcelino commented 1 year ago

Hi!

Following the KMA-way of naming individual and PE reads: two (paired) sequence reads will form a 'fragment' (which is also called 'number of PE reads' in some studies).

alvanuffelen commented 1 year ago

I have no experience with the combination of PE reads and KMA, so I can't speak on that. But in the case of SE reads, that formula doesn't hold up. I have a simple test case (see attached) that contains a reference sequence and 3 query sequences that map to the reference sequence at 100% identity (adapted from an issue at KMA). The output for mapstat is:

## method   KMA
## version  1.3.28
## database db
## fragmentCount    3
## date 2023-03-21
## command  kma -i "query.fna" -a -o "mapped/results" -t_db db -1t1 -mem_mode -ef 
# refSequence   readCount   fragmentCount   mapScoreSum refCoveredPositions refConsensusSum bpTotal depthVariance   nucHighDepthVariance    depthMax    snpSum  insertSum   deletionSum readCountAln    fragmentCountAln
ref1A   3   3   240 240 240 240 0.057476    240 1   0   0   0   3   3

Using the formula, the total read count is 6 and the perc_map is 50%, which is incorrect. ref.fna query.fna

vrmarcelino commented 1 year ago

Hi,

Could you tell me the commands that you used to obtain the %50 fraction from CCMetagen and send me your .res and .mapstat files for this test?

Thanks!

alvanuffelen commented 1 year ago

The command I used is: CCMetagen.py -i mapped/results.res -o results -ef y -map mapped/results.mapstat -d 0.01 -c 5

This results in:

| # refSequence | readCount | fragmentCount | mapScoreSum | refCoveredPositions | refConsensusSum | bpTotal | depthVariance | nucHighDepthVariance | depthMax | snpSum | insertSum | deletionSum | readCountAln | fragmentCountAln | perc_map |
|---------------|-----------|---------------|-------------|---------------------|-----------------|---------|---------------|----------------------|----------|--------|-----------|-------------|--------------|------------------|----------|
| 562|ref1A    | 3         | 3             | 240         | 240                 | 240             | 240     | 0.057476      | 240                  | 1        | 0      | 0         | 0           | 3            | 3                | 50.0     |

kma_1_4_11.results.zip

Additionally, I noticed a second bug while using KMA v 1.3.28. The .res file is read in through: df = pd.read_csv(f, sep='\t', index_col=0, encoding='latin1') The problem is that the .res file is not a properly tab-separated file when using this version of KMA: image So the index becomes 562|ref1A. This is a problem when using this index to filter df_stats on line 363. df_stats is a DataFrame of .mapstat which is properly tab-separated. Therefore, df_stats_filt will always be empty because the indices of df and df_stats are never the same due to the extra whitespaces. kma_1_3_28.results.zip

vrmarcelino commented 1 year ago

Hi! Sorry for yet another question - which version of CCMetagen are you using? I wasn't able to reproduce the 50% issue - when running here I got that 100% of the reads were mapped to the database. We did fix a bug we had when calculating % coverage a while ago, which from memory only affected the displayed % but not the abundance table. If you do have an older version, could you try updating it?

And thanks for pointing out the issue with KMA v1.3.28, that is good to know! Please use version > 1.4 for these analyses.

Thanks!

alvanuffelen commented 1 year ago

Hi! I am using CCMetagen version 1.4.0.

I use the command: CCMetagen.py -i mapped/results.res -o results -ef y -map mapped/results.mapstat -d 0.01 -c 5 which generates following output:

Reading file mapped/results.res
csv file saved as results.ccm.csv

Writing results.html...
krona file saved as results.html

calculating read mapping stats...

Stats file saved as results_stats.csv

Proportion of reads mapped to the database: 50.000000%

It is strange that you get 100%. The formula for the total number of reads is total_reads = 2 * int(total_frags). The total_frags is taken from the .mapstat file at line 4 (## fragmentCount 3). So the total_reads is 6 according to the formula, but my query.fna only contains 3 threads. Therefore, the proportion of mapped reads is 50% instead of 100%.

vrmarcelino commented 1 year ago

Hi!

Ha, it turns out that I have an older version of CCMetagen! (new computer, something went wrong when syncing) So it seems we introduced that error in one of the updates. I'll check that in more detail (but probably only next week, sorry).

Meanwhile, if you use version 1.2.3 (the one I have atm) you should get the correct fraction. Let me know if that works.

Thanks for spotting this!

alvanuffelen commented 1 year ago

Indeed, using version 1.2.3 gives 100% (although in this version, it is mapped fragments instead of mapped reads). Thank you!

vrmarcelino commented 1 year ago

Hi again!

So - the problem you detected was introduced in our last CCMetagen version (the edit was made to fix an issue when the user does not force paired-end reads to map to the same template, but I didn't realise at the time that this would affect analyses using single-end reads).

I've now added a step to check whether the analyses were based on single or paired end reads - the updated version is 1.4.1.

Sorry for the inconvenience, and thanks a lot for reporting and helping me identify the issue.

Cheers