secastel / phaser

phasing and Allele Specific Expression from RNA-seq
GNU General Public License v3.0
107 stars 37 forks source link

separate read counts from multiple bam files from one individual/importance of pre-phasing? #25

Closed mce1 closed 4 years ago

mce1 commented 7 years ago

Dear Stephane,

With great interest I have read your recent publication in Genome Biology and seen phASER, which I would really like to use for my own work. So I was wondering if I could ask your advise on some points.

I'm trying to assess allele-specific gene expression in single-cell RNA-Seq data from human blood cells. From each individual, I have a fairly large number of single-cell transcriptomes (approximately 200-300). In addition, I have one exome per individual. My aim is to obtain the most accurate ASE counts possible from each individual single-cell transcriptome.

My strategy so far is:

1) Call variants from the exome. (I have used a bcbio pipeline for this [https://bcbio-nextgen.readthedocs.io/en/latest/contents/testing.html#exome-with-validation-against-reference-materials].)

2) Map the sc-RNA-Seq reads to the reference genome. (I have used STAR for the mapping and then WASP to eliminate mapping biases.)

3) Next, I would like to use phASER to obtain gene-level haplotypic read counts for each individual transcriptome using the variant file from step 1 and the bam file(s) generated in step 2.

In step 3, of course, I would like to use the reads from all the sc-transcriptomes of a given individual for the read backed phasing but obtain separate read counts for each individual single cell. Do I understand you correctly that this in principle can be achieved using --bam <list of all the bam files> --haplo_count_bam 1 to output haplotypic read counts for bam file 1 only, while using the information from all the bam files for phasing? If yes, is it possible to obtain separate read counts for multiple bam files from the same run (e.g. specifying --haplo_count_bam 1, 2, ...) or would I have to run the script repeatedly, once for each bam file?

I also meant to ask another question: In your tutorial you emphasize that the variant file should be pre-phased using a method like population phasing. I was wondering how important that is in the above situation (i.e. with a vcf that has been constructed from a single exome and with a really large number of RNA-Seq reads that can be used for read backed phasing). Do you think pre-phasing would still improve the quality of allelic counts here? I'm asking this because the various servers for phasing and imputation all require GRCh37-based coordinates, while I would prefer to work with hg38. (I have noticed that the bed files you provide are also based on hg19 but I suppose these could be lifted over to hg38 pretty easily?) Are you aware of any resources for population-based (pre-)phasing based on hg38 coordinates?

Would be great if you could have a look at this and give me some feedback.

Thanks a lot in advance,

Maik

secastel commented 7 years ago

Hi Maik, I'm glad that you have found our publications and software useful! I would say that your use case is spot on, you are thinking about everything the right way. I have a few tips for you though that will answer your questions.

First off, population pre-phasing isn't necessary, but it does help substantially. This is because it allows phASER to combine data across variants that are not read-back phased using the population phasing. However, in the case of exome-seq data it may actually not make a huge difference, since it is more beneficial for array or WGS genotype calls. You are correct in that all of the population phasing servers that I know of run on HG19 (GRCH37) and have not yet moved over to hg38. I haven't myself moved over either so I can't be of help there.

For running phASER, in your case you have two options.

Option 1) Population phase your input VCF. Then run phASER in two steps. In the first step combine all of the data you have for each individual (exome-seq reads + all of the scRNA-seq reads). It would probably be easier to merge all of the scRNA-seq BAMs into a single file using e.g. samtools merge rather than provide phASER with hundreds of files as input. What this run will do is to produce the best phasing possible using all of the data available. Next, take the resulting outputted VCF from that run and use that to run phASER individually for each of the scRNA-seq libraries. You can use the outputs per library to generate the gene level haplotype counts.

Option 2) For this option you do not need to population phase the VCF. As before, create a merged BAM of all of the scRNA-seq libraries and provide these with the exome-seq BAM as inputs to phASER. It's kind of inefficient, but run this for every scRNA-seq library, and provide a third BAM, which will be the single scRNA-seq library you would like to generate haplotype counts for. Make sure to set --haplo_count_bam to whichever bam is the single scRNA-seq library.

With respect to your question about obtaining separate read counts for each BAM input. Unfortunately, at this point I don't have any plans to implement such a feature, since it can be addressed in these specific use cases as I have written above, but it may be something I will consider in future.

I have been meaning to put together a tutorial on how to best use phASER for variant phasing, similar to the tutorial on using it to generate ASE data, that would go over this kind of use case, but have not yet had time to do so. I hope what I have written above makes sense. Please let me know.

mce1 commented 7 years ago

Dear Stephane,

thanks a lot for your comprehensive reply and apologies for being so slow getting back to you. What you say seems very clear to me. (The only point I am not sure I completely understand is why option 1 requires/benefits from population phasing while option 2 does not.)

In the meantime, I have given option 1 a try as it seems to be the more efficient approach as you said. In principle, all worked fine. However, I noticed that I was left with very few haplotypic counts in the end, maybe not even enough to carry on with my downstream analysis. While I am sure this primarily reflects a problem with the quality/depth of my sequencing libraries, I also meant to exclude any issues on the bioinformatics side.

To do so, I tried to assess the number of variants in my final vcf file that are covered by sequencing reads in a given single-cell bam file using bedtools intersect. With the example files below this gave me 280 sites:

$ bedtools intersect -a phaser_round1.vcf.gz -b 1098.wasp.bam | sort | uniq -c | wc -l

***** WARNING: File 1098.wasp.bam has inconsistent naming convention for record:
GL000192.1  408151  408245  SRR2972013.1067934.1/2  255 +

***** WARNING: File 1098.wasp.bam has inconsistent naming convention for record:
GL000192.1  408151  408245  SRR2972013.1067934.1/2  255 +

280

When I run phaser on the same bam file and vcf file, the output haplotypic counts file, however, contains only 120 lines: 10 of these haplotypes are based on two variants each, making 130 variants in total.

python $PathToPhaser/phaser/phaser.py --bam 1098.wasp.bam --paired_end 1 --mapq 255 --baseq 10 --vcf phaser_round1.vcf.gz --sample person-1299-exome --pass_only 0 --threads $NC --unphased_vars 1 --o test

I seem to observe such a difference (only sligthly less than half of the variants overlapped by sequencing reads according to bedtools intersect appear in the haplotypic counts file generated by phaser) consistently for all the bam files I have checked so far. As you may have noticed, I run phaser without any filters (specifically the mapq 255 does not make any difference as these are all uniquely mapped STAR reads to start with and reducing -baseq does not have any effect either). Do you have any idea why that could be and how I could increase the number of sites considered by phaser? (Unfortunately, the bam file is too large to be uploaded here, so I have attached the results of the two commands above instead: bedtools.test.txt, test.haplotypic_counts.txt.)

Your help would be much appreciated and thanks a lot again,

Cheers,

Maik

secastel commented 7 years ago

HI Maik, Barring MAPQ and BASEQ being the culprits here I can think of two other things in phASER off the top of my head that could be lowering read counts.

1) phASER removes duplicate reads by default. If you have marked duplicates then this may be causing the reduction, if you haven't marked duplicates then ignore this.

2) When run in paired end mode phASER requires that reads have the "read mapped in proper pair" flag set (2). To see how much of an effect this has manually, what you can try doing is filtering your bam file using e.g. "samtools view -bh -f 2 input.bam > input.filtered.bam" and then intersect this filtered bam with your VCF to estimate variant coverage. I suspect requiring this flag is what is causing the reduction in read count.

Please get back to me and let me know. We can then figure out how to move forward from there.

Stephane

secastel commented 7 years ago

Hi Maik, I'm not sure if you identified what was causing your reduced read counts versus what you expected, but I recently identified a bug, that in one circumstance would explain it. I don't think you were based on your posts, but if you happened to have been using the --haplo_count_blacklist argument there were problems associated with that. These problems have been fixed as of v0.9.9.4. Note that you will also need to download a new "hg19_haplo_count_blacklist.bed" file.

Stephane

mce1 commented 7 years ago

Hi Stephane,

thanks a lot for your mail.

I've been quite busy doing lots of other things but will be back to
this hopefully by the end of the week. I'll make sure to upgrade to
the latest version and let you know how it goes.

Cheers,

Maik

Zitat von Stephane Castel notifications@github.com:

Hi Maik, I'm not sure if you identified what was causing your reduced read
counts versus what you expected, but I recently identified a bug,
that in one circumstance would explain it. I don't think you were
based on your posts, but if you happened to have been using the
--haplo_count_blacklist argument there were problems associated with
that. These problems have been fixed as of v0.9.9.4. Note that you
will also need to download a new "hg19_haplo_count_blacklist.bed"
file.

Stephane

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/secastel/phaser/issues/25#issuecomment-310201893

secastel commented 7 years ago

One other thing I was going to mention, the next update of phASER should improve generally the quality and accuracy of gene level quantifications of haplotypic expression, and I expect this to be out sometime within the next week. In addition, there will be a new feature that allows for haplotypic expression to outputted from multiple BAMs in a single run (which I think was a feature you were initially asking for in this thread). I think for your use case this will be very useful, as you can do a single run of phASER and output counts for each input BAM.