Advice on phasging protocole with RNASeq only

mariabernard commented 4 years ago

Dear Stephane,

I am working on a RNASeq project with multiple chicken populations and multiple tissues per population for a total of around 700 samples. Our wish is to do ASE analysis per population and tissu combination (POP_tissue).

I called variants at the POP_tissue level on my RNASeq data. But I easy can call variant at POP level. For now I run phaser.py and phaser_gene_ae.py command for each sample individually (if samples is sequences multiple times, I merged the bam, so I have one sample = one bam), and then I want to launch the phaser_expr_matrix.py.

My understand is that this is only possible if my input VCF file is phased, but is this not partially the goal of phaser ? Wat choices do I have to use phaser in this context ? Would it be better to :

run phaser 1 time with all the tissus corresponding to the same individuals (so on one VCF called at the POP level). Will this VCF file contained all my tissue samples?
and use this (tissue ? ) phased VCF file to generate the allelic_count, then to run gene_ae.py for each sample individually and finally pop_expr_matrix on all sample from the same tissue?

I am not sure to well understand how everything interact and what is the impact of doing it on unphased input VCF file.

Kind regards

Maria

fredJehl commented 3 years ago

As a follow up to Maria's question (we are in the same research team), we would like to know what exactly means "phased VCF" :

a VCF outputed from GATK, with small phases (most likely SNPs from a same read), or
a VCF from other tools used by quantitative geneticists (e.g. FImpute).

We are currently working on a small chicken population (< 20 individuals), without DNA-seq data. The SNP calling was done using RNA-seq data, and we owuld like to do the ASE analyssi with these same RNA-seq data.

So in this context (small population, only RNA-seq data) our questions are :

can we use phaser in this contextto obtain reliable haplotypes computed for each individuals ? Do we agree that these haplotypes come from the overlap of the reads ?
can we use phaser_gene_ae to gene-level ASE values in these indivuals ? 2.1. by the way, when a gene has several haplotypes in an indiviudal, how do you chose the one used for the counts ? (It appears not to be the longest nor the one with the most variants in it. We observed in our data that in ~90% of the cases it is the one with the most total counts)
even better, can we use phaser_pop to obtain gene-level ASE values in this population ?

Many thanks for your time,

Frédéric

secastel commented 3 years ago

Hi Both, Sorry for the delayed response, and hope I can be of help.

There were quite a few questions, so I will try to simplify rather than answer all of them. If I understand correctly, you are starting with RNA-seq data only, across multiple tissues, for multiple individuals? You would like to get gene-level ASE values for each individual?

To start, you will need to perform genotype calling from the RNA-seq reads with GATK or an equivalent software. phaser does not genotype calling. Next, per individual, run run phaser with the bam from each tissue as an input, e.g. phaser.py --bam tissue1.bam,tissue2.bam,tissue3.bam,etc... Finally, run phaser_gene_ae using the output from phaser. At this point you will have an estimate of gene-level ASE for each individual.

If you want to combine these estimates into a single file across all individuals you can use phaser_expr_matrix.py. Note, this will not change any of the estimates, it will simply combine them into a single easy to use file where each row is a gene, and each column are samples (individual * tissue).

A few additional points. Phaser works best if you can perform population phasing on your genotype data before running phaser. This provides a genome-wide phase scaffold from linkage-disequilibrium that phaser can improve upon using the information contained within sequencing reads. Starting with a genome-wide phased VCF will also increase the amount of gene-level ASE data because read counts from different haplotype blocks can be combined. You are correct in your observation that without this pre-phasing, phaser will report the read counts from the haplotype block with the highest coverage, since it can't combine data across blocks that are not phased with respect to one another.

I think overall, you are on the right track. Please let me know if you have any other specific questions.

Best, Stephane

mariabernard commented 3 years ago

Hi Stephane,

Thanks for your reply, it confirmed what we have in mind.

Regards

Maria

secastel / phaser

Advice on phasging protocole with RNASeq only #60