velocyto-team / velocyto.py

RNA velocity estimation in Python
http://velocyto.org/velocyto.py/
BSD 2-Clause "Simplified" License
160 stars 83 forks source link

10x bamfile removes intronic reads? #102

Open davidhbrann opened 6 years ago

davidhbrann commented 6 years ago

Hi,

I recently ran the velocyto run10x command on the output folder I got from cellranger count. For the most part, things ran smoothly. However, I'm concerned about two things: whether the possorted_genome_bam.bam actually contains all the intronic reads mapped by STAR and whether all the reads it annotates as exonic are actually from reads that should be mapped to exons (and not introns or intergenic regions).

This is because I believe when cellranger has multimapped reads from STAR where one is exonic and one is non-exonic, it always considers the exonic one as truth and reports that one as being the single uniquely aligned read with a MAPQ score of 255. See here and here for more details.

Is that the correct interpretation of their documentation? If so, I think the input to velocyto.py in the possorted_genome_bam.bam file may be incorrect in two ways: it may omit intronic reads and instead return exonic reads—perhaps even if those were not the primary alignment! This seems especially problematic given the high levels of intronic reads from concordant and discordant priming using the 10x protocol that may get thrown away by cellranger's filtering procedure.

I quickly tried to quantify how frequent of an issue this was by looking at some 10x sample datasets. Of the reads that the possorted_genome_bam.bam labelled as having a MAPQ score of 255 and coming from exons (tag RE:A:E), almost 15% came from this MAPQ adjustment procedure (i.e. had the tag MM:i:1, and originally were multimapped with a MAPQ < 255). I haven't looked at the velocyto code carefully enough to see if you account for these situations, or gone back to realign/check these reads in more detail to see whether the alignment to the exonic or intronic region is more "correct", or evaluated how much the results would change if I removed these reads altogether (since maybe you only want to keep MAPQ scores with 255 to begin with).

I'm interested to hear what suggestions you have for this issue.

Best, David

gioelelm commented 6 years ago

Hi,

First of all, thank you for sharing your concerns. This was very useful since I hadn't had yet the opportunity to stop and reflect on all the implications of their pre-processing. I, maybe naively, thought that they wouldn't change the value of the NH tag (multiple mapping tag). If I understand correctly that this is what they are doing. It sounds a controversial thing to do, as it obfuscates the real output of the aligner.

I think that your concerns are well founded, in the sense that it would be preferable to run on a bam file that has not been modified in such a way. I think some adjustments should be done to address it to ensure higher accuracy of exonic and intronic estimation.

Anyways, consistently with the conservative spirit of the rest of the pipeline, what I would do to address this problem is just to exclude the multiple mapping from the counting. I wouldn't flag those mappings as intronic, as I expect that overinflating intronic read counts will introduce bigger errors velocity estimation than overinflating the exonic counts.

In that sense, your observation hints to the fact that, right now, the quantification of the exonic fraction might be inflated (upper bound ~15%) when compared to a more conservative estimate obtained by excluding multiple mappings.

My solution would be a fix to the source of cellranger pipeline to keep the NH flag to a value > 1. Then the pipeline should do its job correctly.

Best

Gioele

davidhbrann commented 6 years ago

Hi Gioele,

Thanks for following up, and I apologize if I didn't make myself clear above. In my understanding, rather than modifying the NH tag, cellranger only modifies the MAPQ score. When it makes the adjustment, which is done at the read level for multi-mapped reads if one is exonic, it switches the MAPQ score of the exonic alignment to MAPQ 255 and adds the MM:i:1 tag and changes the MAPQ scores for the other non-exonic alignments to 0.

Since it doesn't modify the NH tag, if you still filter by excluding reads with NH > 1, then you should exclude all these reads:

samtools view -q 255 -@ $SLURM_NPROCS neurons_900_possorted_genome_bam.bam | grep "MM:" | wc -l
5288191

samtools view -q 255 -@ $SLURM_NPROCS neurons_900_possorted_genome_bam.bam | grep "RE:A:E" | wc -l
34373614

5288191/34373614 * 100
15.38444866460652

samtools view -q 255 neurons_900_possorted_genome_bam.bam | grep -v "MM" | cut -f12 | sort | uniq -c
41794226 NH:i:1

samtools view -q 255 neurons_900_possorted_genome_bam.bam |  grep "MM" | cut -f12 | sort | uniq -c
3478977 NH:i:2
 977004 NH:i:3
 402723 NH:i:4
 158749 NH:i:5
 132911 NH:i:6
  85977 NH:i:7
  35022 NH:i:8
   9650 NH:i:9
   7178 NH:i:10

samtools view neurons_900_possorted_genome_bam.bam | grep "NH:i:1\b" | cut -f 5 | sort | uniq -c
41794226 255

Therefore all the reads that have the NH:i:1 tag are uniquely mapped, but not all the reads with a MAPQ of 255 are (even though in most bam files they should be). Since velocyto.py uses the NH tag, it looks like by default you're actually excluding all these "MAPQ-adjusted-unique reads", which I agree makes the most sense.

It may be true that some of these multimapped reads actually originated from introns. This is supported by the fact that I've often seen separate reads with the same cell barcode / UMI be uniquely mapped (without the MAPQ adjustment) to intronic locations, but this should still be captured after the NH tag filtering of the ambiguous multimapped reads.

The last thing I'll note is that the 10x bcfile used by velocyto.py contains a list of barcodes that are identified based on UMI counts that include these inflated number of uniquely-aligned transcripts, but I haven't thought or tested too carefully whether this procedure causes any biases in determining which cell barcodes to include. For example, cells whose reads have a higher percentage of multimapped reads may appear to have more exonic UMIs (coming either from double-counting of the same UMI to multiple genes or counting intronic UMIs as exonics) than they actually do.

Best, David