Closed dbrg77 closed 3 months ago
The low mapq can be due to several reasons. Multiple alignment will result in mapq 0. But on the other hand, if the primary is bad (has many alignment errors), and the secondary alignment is just slightly worse, this may also cause mapq 0. I think this can explain your all three questions.
Is this related to the maximum insert size and other filters that I don't know?
If a fragment has insert size longer than the specified value, it will be unmapped.
Hope this helps.
Okay, thanks for the explanation.
I was wondering why the numbers are not the same but quite close :-)
If I understand correctly, in my previous example, the number of multi-mapping reads is 18459006, and the number of mapq=0 reads is 161581509. The difference is 18987814 - 18459006 = 528808.
These 528808 reads are considered uniquely mapped, but they had poor alignment and resulted in mapq=0 for a different reason.
It makes senses now.
Thank you very much!
Xi
Oh, I just realised that I have another related question.
I see that summary.csv
contains 5090765 barcodes, but the 10x whitelist 737K-cratac-v1_rc.txt
only has 737280 barcodes. I guess summary.csv
contains information for all barcodes regardless of if they are in the whitelist or not.
So I want to check how many fragments are there in my experiments from barcodes that are in the whitelist. This is what I do:
join -1 1 -2 1 \
<(sort ../737K-cratac-v1_rc.txt) \
<(sed -n '2,$ p' summary.csv | sed 's/,/\t/g' | sort -k1,1) | \
sed 's/ /,/g' > summary_in_whitelist.csv
It runs out the sum of the total column of summary_in_whitelist.csv
is 456955085.
However, the stderr says:
Number of barcodes in whitelist: 438281445.
Number of corrected barcodes: 18669661.
Then 438281445 + 18669661 = 456951106, which is so close to 456955085, but not the same.
Could help explain?
Thanks!
Xi
It's strange that the barcode not in the whitelist should be excluded in the summary. I'll look into this issue.
Oh, I see. the whitelist-only summary is in another dev branch. I'll merge the update in the future. The current version does include non-whitelist barcode in the summary.
For the number discrepancy, I think those are due to 'N''s in the barcode sequence. Internally, the 'N' will be regarded 'A' in some case for simplicity, therefore, reads from barcode like "ACGTNN" cannot be corrected due to too many N's, but it might be counted as the "ACGTAA" barcode in the summary. This is a bug, and is fixed in the other whitelist-only summary branch. But this reminds me that I shall deal with the case of N specially when there is no whitelist given. Thank you for scrutinizing the numbers!
AH I see I see! Good to know that!
Thank you for the clarification, and I really look forward to the next update :-)
Have a good weekend!
Xi
I have added the whitelist-only summary to the li_dev7 branch. If you have time, please give it a test to see whether the result makes sense. I'm working on adding a category in the summary file for the uncorrectable barcode next. I have several other deadlines, so probably can only wrap up this feature in the next week.
Thanks! I have now tested the li_dev7
branch. Now there are 454495 barcodes in the summary.csv
file. The sum of the column total is 456951106, which is equal to 438281445 + 18669661.
I can confirm that the numbers are now consistent!
Thank you very much for the update!!!!
Xi
Thank you for the tests! I just pushed a new update to the li_dev7 branch that add a row "non-whitelist" to the end of the summary file representing the number of fragments that cannot be barcode corrected. Could you please give it a try if it is convenient? If the update works, we will merge it to the master branch and draft a new release. Thank you!
That's great! I have just tested the latest li_dev7
and can confirm it works using my previous sample. Now there is an extra row that starts with 'non-whitelist'. The sums of the columns for whitelist and non-whitelist in my test are:
barcode | total | duplicate | unmapped | lowmapq |
---|---|---|---|---|
whitelist | 456951106 | 234736148 | 51501725 | 9493907 |
non-whitelist | 18630486 | 0 | 18630486 | 0 |
which matches the numbers from the stderr.
Thanks for the update!
Xi
Thank you for helping with the test! We have merged these updates to the master branch. Please feel free to reopen this issue if you find any other discrepancies. I'm working on another bug fix and will release a new version after that.
Hello there,
Thanks for adding the
--summary
feature mentioned in #107, which really helps the troubleshooting. I'm currently trying to understand what they mean, because the numbers do not add up to what is shown in the log (stderr
) during the run.I currently run an experiment done by 10x scATAC-seq, and the data was processed using
0.2.6-r490
with the-q 1 --preset atac
flags for the sake of understanding the numbers (otherwise, we normally use the default-q 30
), like this:The last few lines from
chromap.stderr
are:Then I sum up all columns from the
summary.csv
file, and I get:total : 475581592 duplicate : 234736148 unmapped: 70132211 lowmapq: 9493907
I can see that total * 2 = 951163184, which is consistent with the first line of the stderr file, which also matches the record from our raw
fastq
files (475581592 reads in eachfastq
file).Then, the stderr file told me that:
which makes sense, since 810898762 = 792439756 + 18459006. Now if I check the
summary.csv
file, I see that (total - unmapped) * 2 = 810898762, which is consistent with the stderr file. However, I don't know whatNumber of uniquely mapped reads
andNumber of reads have multi-mappings
really mean.I think multi-mapping reads should have mapq=0, Since I set
-q 1
, I expect lowmapq from thesummary.csv
file should just be the multi-mapping reads. However, lowmapq 2 = 18987814, which is higher than 18459006 indicated by the stderr file. In addition, I guess is that uniquely mapped reads should be total - unmapped - lowmapq, which is (475581592 - 70132211 - 9493907) 2 = 791910948, which is lower than 792439756 indicated by the stderr file.My first question is whether there is any explanation about those discrepancies between the stderr and summary files?
Then, the stderr file has these lines:
My second question is what those numbers? They seem to be just the same numbers as those in lines 2,3,4 from the stderr file, respectively.
My third question is about the last two lines
My understanding about these numbers are that they are the reads after deduplications. The last number is 161219326, which is basically the number of lines in the
fragments.tsv
file. However, the number of uni-mappings are 161581509, which is slightly higher. So, what is further removed from 161581509 to get the final 161219326? Is this related to the maximum insert size and other filters that I don't know?Sorry about those long questions. We are shifting our pipeline to use chromap fully for various scATAC-seq assays from different technologies (not only 10x), and I want to make sure I understand those numbers.
Thank you in advance!
Regards, Xi