secastel / phaser

phasing and Allele Specific Expression from RNA-seq
GNU General Public License v3.0
111 stars 36 forks source link

Summed DNA and RNA counts in allelic_counts output file? #34

Closed freerkvandijk closed 5 years ago

freerkvandijk commented 7 years ago

Dear Stephane,

I used the phASER software using two input BAM files, one containing RNA-seq data, the other containing Whole Genome Sequencing DNA data. I used the --haplo_count_bam_exclude option to exclude the WGS BAM data from my output data.

The number of reads overlapping for example chr1, position 100547994, in my both BAM files are:

#WGS BAM
$ samtools mpileup -r 1:100547994-100547994 DNA/AC1JV9ACXX-2-20.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
1   100547994   N   17  cctcTcTTCcTctCtTT   ;;D;E<EDC;D9DCDEE

#RNA BAM
samtools mpileup -r 1:100547994-100547994 RNA/AC1JV9ACXX-2-20.mdup.sorted.readGroupsAdded.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
1   100547994   N   23  >t$CTCtCtTTcCtCcccccctt^~t  D@IIJCGADGDGJHI@FJJGIIH

I noticed that the *.allelic_counts file contains the summed counts from both RNA and WGS data:

grep 1_100547994_T_C allelic_counts/AC1JV9ACXX-2-20.chr1.allelic_counts.txt 
1   100547994   1_100547994_T_C T   C   19  20  39

While in the *.haplotypic_counts data just 22 reads for this same SNP are used:

grep 1_100547994_T_C haplotypic_counts/AC1JV9ACXX-2-20.chr1.haplotypic_counts.txt 
1   100547994   100547994   1_100547994_T_C 1       0   T   C   10  12  22  0|1 1   0.394   AC1JV9ACXX-2-20.mdup.sorted.readGroupsAdded

Is this intended behaviour for *.allelic_counts output when using a BAM file containing DNA data and the --halpo_count_bam_exclude option?

Regards,

Freerk

secastel commented 7 years ago

Hi Freerk, Thanks for the very through explanation of your issue. Your assessment is correct - the *.allelic_counts file does include counts from all libraries that were used as inputs, even those that were excluded using the --haplo_count_bam_exclude option. As of now this is the intended behavior for the tool - although I can see why one would also want to exclude any reads in the haplo_count_bam_exclude bams, however in it's current implementation this would be difficult. If you are just looking for basic allelic counts, I would suggest trying out allelecounter, since this doesn't require any phasing, and is a much simpler problem. Allelecounter is very fast, and easy to run.

Also, not sure what version of phASER you are running, but I'd suggest updating to the latest release (v1.0.0) if you haven't as there are quite a few improvements in haplotype level ASE data generation.

Stephane

iranmdl commented 6 years ago

Hello! I was reading this issue and it is not clear to me what is going on. I have a VCF file and I want to phase it using both DNA and RNA-Seq data (2 bam files). Once it is phased, I would like to know the ASE using only the RNA-Seq bam file. Is this possible?

Another question is, is it possible to convert read counts to TPM? Or this does not make any sense at all? (I'm a beginner in ASE, sorry about that :) )

Thank you in advance,

secastel commented 6 years ago

To generate only ASE counts using the RNA-seq bam file use the argument: "--haplo_count_bam_exclude N" where N is the index of the provided bam you want to exclude from ASE counts.

for example: "phaser --bam dna.bam,rna.bam --haplo_count_bam_exclude 1"

RE ratios - the values do not need to be normalized by depth because they are ratios of ref/alt counts. If you are new to ASE analysis I suggest you read this paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0762-6

iranmdl commented 6 years ago

Hello again, Thank you for your fast and comprehensive answer. Just one more "easy" (probably not) fast question. If I have a set of somatic variants and I want to calculate mutant transcript specific expression, can I do it using phaser?

secastel commented 6 years ago

In your case you might be better to look at the expression of individual somatic variants rather than an approach which tries to calculate transcript level allelic expression (as phASER does). In this case you would measure ASE across all variants, and then look specifically at somatic variants to test whatever hypothesis you're interested in. You can use a tool like the GATK ASEReadCounter (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_rnaseq_ASEReadCounter.php) or my much more basic allelecounter (https://github.com/secastel/allelecounter). One very strong word of caution though, when analyzing somatic variants, there are often copy number changes, which can cause false positive signals of allelic imbalance, and need to be controlled for.