Closed mariabernard closed 3 years ago
As a follow up to Maria's question (we are in the same research team), we would like to know what exactly means "phased VCF" :
a VCF outputed from GATK, with small phases (most likely SNPs from a same read), or
a VCF from other tools used by quantitative geneticists (e.g. FImpute).
We are currently working on a small chicken population (< 20 individuals), without DNA-seq data. The SNP calling was done using RNA-seq data, and we owuld like to do the ASE analyssi with these same RNA-seq data.
So in this context (small population, only RNA-seq data) our questions are :
Many thanks for your time,
Frédéric
Hi Both, Sorry for the delayed response, and hope I can be of help.
There were quite a few questions, so I will try to simplify rather than answer all of them. If I understand correctly, you are starting with RNA-seq data only, across multiple tissues, for multiple individuals? You would like to get gene-level ASE values for each individual?
To start, you will need to perform genotype calling from the RNA-seq reads with GATK or an equivalent software. phaser does not genotype calling. Next, per individual, run run phaser with the bam from each tissue as an input, e.g. phaser.py --bam tissue1.bam,tissue2.bam,tissue3.bam,etc... Finally, run phaser_gene_ae using the output from phaser. At this point you will have an estimate of gene-level ASE for each individual.
If you want to combine these estimates into a single file across all individuals you can use phaser_expr_matrix.py. Note, this will not change any of the estimates, it will simply combine them into a single easy to use file where each row is a gene, and each column are samples (individual * tissue).
A few additional points. Phaser works best if you can perform population phasing on your genotype data before running phaser. This provides a genome-wide phase scaffold from linkage-disequilibrium that phaser can improve upon using the information contained within sequencing reads. Starting with a genome-wide phased VCF will also increase the amount of gene-level ASE data because read counts from different haplotype blocks can be combined. You are correct in your observation that without this pre-phasing, phaser will report the read counts from the haplotype block with the highest coverage, since it can't combine data across blocks that are not phased with respect to one another.
I think overall, you are on the right track. Please let me know if you have any other specific questions.
Best, Stephane
Hi Stephane,
Thanks for your reply, it confirmed what we have in mind.
Regards
Maria
Dear Stephane,
I am working on a RNASeq project with multiple chicken populations and multiple tissues per population for a total of around 700 samples. Our wish is to do ASE analysis per population and tissu combination (POP_tissue).
I called variants at the POP_tissue level on my RNASeq data. But I easy can call variant at POP level. For now I run phaser.py and phaser_gene_ae.py command for each sample individually (if samples is sequences multiple times, I merged the bam, so I have one sample = one bam), and then I want to launch the phaser_expr_matrix.py.
My understand is that this is only possible if my input VCF file is phased, but is this not partially the goal of phaser ? Wat choices do I have to use phaser in this context ? Would it be better to :
I am not sure to well understand how everything interact and what is the impact of doing it on unphased input VCF file.
Kind regards
Maria