haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
187 stars 20 forks source link

Understanding the terminologies from stderr and summary #167

Closed dbrg77 closed 3 weeks ago

dbrg77 commented 1 month ago

Hello there,

Thanks for adding the --summary feature mentioned in #107, which really helps the troubleshooting. I'm currently trying to understand what they mean, because the numbers do not add up to what is shown in the log (stderr) during the run.

I currently run an experiment done by 10x scATAC-seq, and the data was processed using 0.2.6-r490 with the -q 1 --preset atac flags for the sake of understanding the numbers (otherwise, we normally use the default -q 30), like this:

chromap \
    -t 40 \
    -x genonme.index\
    -r genome.fa \
    -q 1 \
    --preset atac --summary chromap_out/summary.csv \
    -1 fastq/mCortex_ATAC_S1_L001_R1_001.fastq.gz \
    -2 fastq/mCortex_ATAC_S1_L001_R2_001.fastq.gz \
    -b fastq/mCortex_ATAC_S1_L001_I2_001.fastq.gz \
    --barcode-whitelist 737K-cratac-v1_rc.txt -o chromap_out/fragments.tsv \
    1> chromap.stdout 2> chromap.stderr

The last few lines from chromap.stderr are:

Number of reads: 951163184.
Number of mapped reads: 810898762.
Number of uniquely mapped reads: 792439756.
Number of reads have multi-mappings: 18459006.
Number of candidates: 16143953592.
Number of mappings: 810898762.
Number of uni-mappings: 792439756.
Number of multi-mappings: 18459006.
Number of barcodes in whitelist: 438281445.
Number of corrected barcodes: 18669661.
Sorted, deduped and outputed mappings in 268.81s.
# uni-mappings: 161581509, # multi-mappings: 6588743, total: 168170252.
Number of output mappings (passed filters): 161219326

Then I sum up all columns from the summary.csv file, and I get:

total : 475581592 duplicate : 234736148 unmapped: 70132211 lowmapq: 9493907

I can see that total * 2 = 951163184, which is consistent with the first line of the stderr file, which also matches the record from our raw fastq files (475581592 reads in each fastq file).

Then, the stderr file told me that:

Number of mapped reads: 810898762.
Number of uniquely mapped reads: 792439756.
Number of reads have multi-mappings: 18459006.

which makes sense, since 810898762 = 792439756 + 18459006. Now if I check the summary.csv file, I see that (total - unmapped) * 2 = 810898762, which is consistent with the stderr file. However, I don't know what Number of uniquely mapped reads and Number of reads have multi-mappings really mean.

I think multi-mapping reads should have mapq=0, Since I set -q 1, I expect lowmapq from the summary.csv file should just be the multi-mapping reads. However, lowmapq 2 = 18987814, which is higher than 18459006 indicated by the stderr file. In addition, I guess is that uniquely mapped reads should be total - unmapped - lowmapq, which is (475581592 - 70132211 - 9493907) 2 = 791910948, which is lower than 792439756 indicated by the stderr file.

My first question is whether there is any explanation about those discrepancies between the stderr and summary files?

Then, the stderr file has these lines:

Number of mappings: 810898762.
Number of uni-mappings: 792439756.
Number of multi-mappings: 18459006.

My second question is what those numbers? They seem to be just the same numbers as those in lines 2,3,4 from the stderr file, respectively.

My third question is about the last two lines

# uni-mappings: 161581509, # multi-mappings: 6588743, total: 168170252.
Number of output mappings (passed filters): 161219326

My understanding about these numbers are that they are the reads after deduplications. The last number is 161219326, which is basically the number of lines in the fragments.tsv file. However, the number of uni-mappings are 161581509, which is slightly higher. So, what is further removed from 161581509 to get the final 161219326? Is this related to the maximum insert size and other filters that I don't know?

Sorry about those long questions. We are shifting our pipeline to use chromap fully for various scATAC-seq assays from different technologies (not only 10x), and I want to make sure I understand those numbers.

Thank you in advance!

Regards, Xi

mourisl commented 1 month ago

The low mapq can be due to several reasons. Multiple alignment will result in mapq 0. But on the other hand, if the primary is bad (has many alignment errors), and the secondary alignment is just slightly worse, this may also cause mapq 0. I think this can explain your all three questions.

Is this related to the maximum insert size and other filters that I don't know?

If a fragment has insert size longer than the specified value, it will be unmapped.

Hope this helps.

dbrg77 commented 1 month ago

Okay, thanks for the explanation.

I was wondering why the numbers are not the same but quite close :-)

If I understand correctly, in my previous example, the number of multi-mapping reads is 18459006, and the number of mapq=0 reads is 161581509. The difference is 18987814 - 18459006 = 528808.

These 528808 reads are considered uniquely mapped, but they had poor alignment and resulted in mapq=0 for a different reason.

It makes senses now.

Thank you very much!

Xi

dbrg77 commented 1 month ago

Oh, I just realised that I have another related question.

I see that summary.csv contains 5090765 barcodes, but the 10x whitelist 737K-cratac-v1_rc.txt only has 737280 barcodes. I guess summary.csv contains information for all barcodes regardless of if they are in the whitelist or not.

So I want to check how many fragments are there in my experiments from barcodes that are in the whitelist. This is what I do:

join -1 1 -2 1 \
    <(sort ../737K-cratac-v1_rc.txt) \
    <(sed -n '2,$ p' summary.csv | sed 's/,/\t/g' | sort -k1,1) | \
    sed 's/ /,/g' > summary_in_whitelist.csv

It runs out the sum of the total column of summary_in_whitelist.csv is 456955085.

However, the stderr says:

Number of barcodes in whitelist: 438281445.
Number of corrected barcodes: 18669661.

Then 438281445 + 18669661 = 456951106, which is so close to 456955085, but not the same.

Could help explain?

Thanks!

Xi

mourisl commented 1 month ago

It's strange that the barcode not in the whitelist should be excluded in the summary. I'll look into this issue.

mourisl commented 1 month ago

Oh, I see. the whitelist-only summary is in another dev branch. I'll merge the update in the future. The current version does include non-whitelist barcode in the summary.

For the number discrepancy, I think those are due to 'N''s in the barcode sequence. Internally, the 'N' will be regarded 'A' in some case for simplicity, therefore, reads from barcode like "ACGTNN" cannot be corrected due to too many N's, but it might be counted as the "ACGTAA" barcode in the summary. This is a bug, and is fixed in the other whitelist-only summary branch. But this reminds me that I shall deal with the case of N specially when there is no whitelist given. Thank you for scrutinizing the numbers!

dbrg77 commented 1 month ago

AH I see I see! Good to know that!

Thank you for the clarification, and I really look forward to the next update :-)

Have a good weekend!

Xi

mourisl commented 1 month ago

I have added the whitelist-only summary to the li_dev7 branch. If you have time, please give it a test to see whether the result makes sense. I'm working on adding a category in the summary file for the uncorrectable barcode next. I have several other deadlines, so probably can only wrap up this feature in the next week.

dbrg77 commented 1 month ago

Thanks! I have now tested the li_dev7 branch. Now there are 454495 barcodes in the summary.csv file. The sum of the column total is 456951106, which is equal to 438281445 + 18669661.

I can confirm that the numbers are now consistent!

Thank you very much for the update!!!!

Xi

mourisl commented 1 month ago

Thank you for the tests! I just pushed a new update to the li_dev7 branch that add a row "non-whitelist" to the end of the summary file representing the number of fragments that cannot be barcode corrected. Could you please give it a try if it is convenient? If the update works, we will merge it to the master branch and draft a new release. Thank you!

dbrg77 commented 3 weeks ago

That's great! I have just tested the latest li_dev7 and can confirm it works using my previous sample. Now there is an extra row that starts with 'non-whitelist'. The sums of the columns for whitelist and non-whitelist in my test are:

barcode total duplicate unmapped lowmapq
whitelist 456951106 234736148 51501725 9493907
non-whitelist 18630486 0 18630486 0

which matches the numbers from the stderr.

Thanks for the update!

Xi

mourisl commented 3 weeks ago

Thank you for helping with the test! We have merged these updates to the master branch. Please feel free to reopen this issue if you find any other discrepancies. I'm working on another bug fix and will release a new version after that.