secastel / phaser

phasing and Allele Specific Expression from RNA-seq
GNU General Public License v3.0
107 stars 37 forks source link

Recommendations for running phaser with just the rnaseq data #82

Closed aswathyseb closed 3 months ago

aswathyseb commented 3 months ago

Hi Stephane,

I am very new to ASE analysis and having looked around the tools that I could use, I liked phASER, especially the feature that it can nicely summarize the data at the gene level.

I have RNASeq data from two maize genomes and their F1 hybrids in two conditions-control and treatment. I do not have a genome-wide phased VCF file. Total 100 samples. Following your tutorial, so far what I have done is

  1. called variants from RNAseq data in each sample using bcftools
  2. Ran phaser to obtain gene-level haplotype counts.

However, I noticed that for a gene, the gwPhased column is given as 0 in the control sample and 1 in the treatment sample.

Do you have any recommendations for using the phaser to phase the variants and obtaining the gene-level counts?

Eventually, I would like to compare the log2AFC between the conditions.

Please let me know.

Thanks,

secastel commented 3 months ago

Hi there, glad you are interested in using phASER. In your case, because you are not starting with genome-wide phased genotype calls, I believe that neither sample should have 1 for the gwPhased column. You should check that the input VCF in both cases is not phased (GT calls are e.g. 0/1 not 0|1). Is it consistent across all the genes for each sample?

aswathyseb commented 3 months ago

Thank you Stephane for the quick reply. I checked and my input vcfs are not phased. But looking into the outputs I realized gwPhased column is 1 only when the haplotype counts are 0. So my question is, are the log2FC value for geneA in control sample and treatment sample are comparable? If not, what is your recommendation to get comparable counts since I do not have phase information?

Because the parental maize genomes are inbred lines and I have F1 hybrids. Also looking F1 hybrid bam file in IGV, most variants appear to be in phase. So is it okay to assume 0|1 ?

secastel commented 3 months ago

I'm not entirely sure of your research question, but this is a non-trivial analysis to do. It is not okay to assume 0|1. You would need to assign each haplotype in the F1s to one of the two parental lines, and you'd need to make sure it's consistent in each F1 you want to compare (e.g. haplotype A = parent 1 and haplotype B = parent 2). This would require phasing using the parental genotype data, which is beyond the scope of phASER. There are other tools for this (e.g. GATK). If you are able to do phasing using this approach, and then make sure the haplotypes are assigned the same way in the samples you want to compare, you could then provide the phased VCF to phASER as input, and you could compare the aFCs. Unfortunately I can't provide guidance beyond that, but best of luck!