haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
192 stars 21 forks source link

"Number of mapped reads" from log file #144

Open BingjieZhang opened 12 months ago

BingjieZhang commented 12 months ago

Hello Chromap Team,

Thank you very much for actively maintaining the chromap!

I recently used Chromap for mapping scATAC-seq data with a barcode whitelist. I found that the log file is a bit confusing. As stated in the documentation, when barcodes and a whitelist are given as input, Chromap will, by default, estimate barcode abundance and perform barcode correction.

I am looking to understand the following QC numbers from the log file:

  1. The total number of mapped reads (regardless of whether the read has a valid cell barcode or not).
  2. The number of mapped reads that have a valid cell barcode.
  3. The number of deduplicated, uniquely mapped reads.

In relation to these questions:

For Q1, should I refer to the "Number of mapped reads" in the log file? For Q2, what does "Number of barcodes in whitelist" represent? Does it indicate the number of barcodes, or the number of reads with the whitelisted barcodes?

Number of reads: 153220788. Number of mapped reads: 71076182. Number of uniquely mapped reads: 66746464. Number of reads have multi-mappings: 4329718. Number of candidates: 842925044. Number of mappings: 71076182. Number of uni-mappings: 66746464. Number of multi-mappings: 4329718. Number of barcodes in whitelist: 37926511. Number of corrected barcodes: 3772723. Sorted, deduped and outputed mappings in 48.51s. uni-mappings: 32567263, # multi-mappings: 1919563, total: 34486826. Number of output mappings (passed filters): 30174748

These metrics are very useful for my experimental debugging, and I would greatly appreciate your clarification.

mourisl commented 12 months ago

For the "number of mapped reads", I believe it is only from those barcode-valid (barcode in the whitelist or corrected barcode) reads.

Q1: If you need the number for unfilterer mapped reads, "number of mapped reads" is the place to look at. What number do you have in mind? Q2: That is the number of reads with the whitelisted barcodes.

BingjieZhang commented 12 months ago

Thanks for your responses! Sorry, but I'm not sure if I fully understand what you mean. What do you mean by 'unfiltered' mapped reads? I prefer to know the number of mapped reads regardless of whether the reads have a valid cell barcode or not. I am trying to figure out why I started with 153,220,788 reads, but ended up with only 30,174,748, lol. The reason I feel confused is that for the same sample, I also did a bulk mapping with Bowtie2. As you can see below, the mapping rate is okay, with an 86.75% overall alignment rate and a 56% unique mapping rate (Bowtie2 counts paired-end fragments once, so it's half the number compared to Chromap, but they are mapped with the same FASTQ files).

However, for Chromap, even before deduplication, the ratio is 37,926,511/153,220,788 = 24.7% So, I want to know at which step I am losing reads. If Number of mapped reads: 71,076,182 already includes valid barcodes filtering step (filtered by the whitelist), what are the filtered reads between Number of uni-mappings: 66,746,464 and Number of barcodes in whitelist: 37,926,511? I initially thought 'Number of mapped reads' represented the overall mapping rate, but then it is way lower than the results from Bowtie2.

Hopefully, I have explained my questions clearly, and thank you very much for your help in advance.

bulk mapping summary using bowtie2 bowtie2 -x /hg38/ -1 $name\_R1_val_1.fq* -2 $name\_R2_val_2.fq* --local --very-sensitive-local --no-unal --no-mixed --no-discordant --phred33 -I 10 -X 700 -p 5 -q

76500578 reads; of these: 76500578 (100.00%) were paired; of these: 10140036 (13.25%) aligned concordantly 0 times 43298939 (56.60%) aligned concordantly exactly 1 time 23061603 (30.15%) aligned concordantly >1 times 86.75% overall alignment rate

mourisl commented 12 months ago

The read with invalid barcode will not be mapped, so the mapped read count won't include them. The number 37926511 is with respect to the read fragment (mate pair together), and 153220788 is the read ends (2 times read fragments). Still, the number of barcodes found in the whitelist is too few, causing the overall low alignment rate. You can run Chromap without whitelist and check the alignment rate, which may confirm that the barcode match step is the culprit.