haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
192 stars 21 forks source link

Understanding the multi-mapping reads and whether they are part of the bed file #151

Closed NPSDC closed 8 months ago

NPSDC commented 8 months ago

I am trying to analyze single-cell ATAC-Seq dataset with chromap. I am attaching the screenshot below the output of stdout which is obtained after running chromap on a dataset against the human genome. I am trying to understand the output from the last 3 lines. I have the following questions

  1. The mapping information has been reported after deduplication and contains the mapping information w.r.t read pair and not the individual mates, contrary to the mapping information provided in the lines above it?
  2. The filtering criteria is applied only to the reads/fragments obtained after deduplication and not to the reads that were mapped prior to the deduplication step? Could you also specify the filtering criteria (mapq < threshold (30 for ATAC-Seq, I guess), something else)? (I have used the default mode for ATAC-Seq, and not used any whitelist for the moment.)
  3. If a read(pair) multimaps to the reference (lets say they all have same MAPQ scores), are all the instances of that hit reported or only a single instance is picked randomly and reported in the bed file. I have run Bowtie2 independently on the dataset and know for a fact there exists multimapping hits with the same mapping quality scores for a certain set of reads. image
mourisl commented 8 months ago

Sorry for the delayed reply.

  1. Yes, they correspond to read fragment.
  2. The filter is applied after deduplication. THe filter is just mapq < 30. For the duplicates (read pairs span the same interval), we use the max{mapq} among them.
  3. The multiple mapped reads always have mapq 0, so they will be filtered.

Hope this helps.

NPSDC commented 8 months ago

Thanks for getting back. Yes, it clears it. We do not keep the multi mapping reads and reads with lower mapping rate (mapq < 30).