secastel / phaser

phasing and Allele Specific Expression from RNA-seq
GNU General Public License v3.0
105 stars 36 forks source link

Advice on phasging protocole with RNASeq only #60

Closed mariabernard closed 3 years ago

mariabernard commented 4 years ago

Dear Stephane,

I am working on a RNASeq project with multiple chicken populations and multiple tissues per population for a total of around 700 samples. Our wish is to do ASE analysis per population and tissu combination (POP_tissue).

I called variants at the POP_tissue level on my RNASeq data. But I easy can call variant at POP level. For now I run phaser.py and phaser_gene_ae.py command for each sample individually (if samples is sequences multiple times, I merged the bam, so I have one sample = one bam), and then I want to launch the phaser_expr_matrix.py.

My understand is that this is only possible if my input VCF file is phased, but is this not partially the goal of phaser ? Wat choices do I have to use phaser in this context ? Would it be better to :

I am not sure to well understand how everything interact and what is the impact of doing it on unphased input VCF file.

Kind regards

Maria

fredJehl commented 3 years ago

As a follow up to Maria's question (we are in the same research team), we would like to know what exactly means "phased VCF" :

We are currently working on a small chicken population (< 20 individuals), without DNA-seq data. The SNP calling was done using RNA-seq data, and we owuld like to do the ASE analyssi with these same RNA-seq data.

So in this context (small population, only RNA-seq data) our questions are :

  1. can we use phaser in this contextto obtain reliable haplotypes computed for each individuals ? Do we agree that these haplotypes come from the overlap of the reads ?
  2. can we use phaser_gene_ae to gene-level ASE values in these indivuals ? 2.1. by the way, when a gene has several haplotypes in an indiviudal, how do you chose the one used for the counts ? (It appears not to be the longest nor the one with the most variants in it. We observed in our data that in ~90% of the cases it is the one with the most total counts)
  3. even better, can we use phaser_pop to obtain gene-level ASE values in this population ?

Many thanks for your time,

Frédéric

secastel commented 3 years ago

Hi Both, Sorry for the delayed response, and hope I can be of help.

There were quite a few questions, so I will try to simplify rather than answer all of them. If I understand correctly, you are starting with RNA-seq data only, across multiple tissues, for multiple individuals? You would like to get gene-level ASE values for each individual?

To start, you will need to perform genotype calling from the RNA-seq reads with GATK or an equivalent software. phaser does not genotype calling. Next, per individual, run run phaser with the bam from each tissue as an input, e.g. phaser.py --bam tissue1.bam,tissue2.bam,tissue3.bam,etc... Finally, run phaser_gene_ae using the output from phaser. At this point you will have an estimate of gene-level ASE for each individual.

If you want to combine these estimates into a single file across all individuals you can use phaser_expr_matrix.py. Note, this will not change any of the estimates, it will simply combine them into a single easy to use file where each row is a gene, and each column are samples (individual * tissue).

A few additional points. Phaser works best if you can perform population phasing on your genotype data before running phaser. This provides a genome-wide phase scaffold from linkage-disequilibrium that phaser can improve upon using the information contained within sequencing reads. Starting with a genome-wide phased VCF will also increase the amount of gene-level ASE data because read counts from different haplotype blocks can be combined. You are correct in your observation that without this pre-phasing, phaser will report the read counts from the haplotype block with the highest coverage, since it can't combine data across blocks that are not phased with respect to one another.

I think overall, you are on the right track. Please let me know if you have any other specific questions.

Best, Stephane

mariabernard commented 3 years ago

Hi Stephane,

Thanks for your reply, it confirmed what we have in mind.

Regards

Maria