natsuhiko / rasqual

Robust Allele Specific Quantification and quality controL
37 stars 20 forks source link

Using -m and -l #13

Closed gouthamatla closed 6 years ago

gouthamatla commented 6 years ago

Dear Natsuhiko,

I would like to get a clarification on using -m and -l for RNA-Seq data.

I am primarily interested in allele-specific gene expression. So I am planning to use the SNPs with in exons with "--as-only" option so that I end up with genes showing allele specific expression. Is this right approach ?

In a hypothetical scenario, there could be 200 SNPs that span the genomic space of the gene ( exons + introns ) but only 10 SNPs might overlap the exons. In this case, should I say -m 10 -l 200 ?

Also, sometimes I get the "-nan" in column 12 (Squared correlation between prior and posterior genotypes (fSNPs). What does this mean ?

Thanks, Goutham A

natsuhiko commented 6 years ago

Dear Goutham,

Sorry for the late reply. I’m on holiday in this month.

Could you explain what dose “allele-specific gene expression” specifically mean? RASQUAL can be used to map eQTLs using the allele-specific signal but it is not able to estimate the allelic imbalance from expression data in general.

The options -l and -m are both used to allocate memory. Any number less than the number of SNPs in the cis-window (-l) or feature region(s) (-m) causes memory overflow. If you are lazy to estimate the exact number of SNPs, you can specify sufficiently large number (e.g., 1000 in your example) for both -m and -l options.

RASQUAL estimates genotypes from the sequenced data. The column 24 and 25 captures discrepancy between the input (prior) genotypes and the estimated (posterior) genotypes.

Best regards, Natsuhiko

On 9 May 2018, at 13:49, Goutham notifications@github.com wrote:

Dear Natsuhiko,

I would like to get a clarification on using -m and -l for RNA-Seq data.

I am primarily interested in allele-specific gene expression. So I am planning to use the SNPs with in exons with "--as-only" option so that I end up with genes showing allele specific expression. Is this right approach ?

In a hypothetical scenario, there could be 200 SNPs that span the genomic space of the gene ( exons + introns ) but only 10 SNPs might overlap the exons. In this case, should I say -m 10 -l 200 ?

Also, sometimes I get the "-nan" in column 12 (Squared correlation between prior and posterior genotypes (fSNPs). What does this mean ?

Thanks, Goutham A

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

gouthamatla commented 6 years ago

Thanks for responding.

Could you explain what dose “allele-specific gene expression” specifically mean? RASQUAL can be used to map eQTLs using the allele-specific signal but it is not able to estimate the allelic imbalance from expression data in general.

If I use "--as-only" option, I thought that the association test is done only using allelic counts ( alleleic ratios), and If I restrict my SNPs to fSNPs, I would end up with fSNPs that show allelic imbalance. Sorry if I am mistaken.

natsuhiko commented 6 years ago

I think you can introduce a dummy rSNP with all heterozygous individuals (if you have multiple samples) in the VCF file and test whether there is an expression difference between two alleles at the rSNP linked to fSNPs in coding regions. Although, I'm not quite sure it works fine to (1) estimate the over-dispersion and (2) control the P-value distribution in the null hypothesis. I would recommend to internally test that RASQUAL can be used to analyse allele-specific differential expression in general.

gouthamatla commented 6 years ago

Thanks, I will explore in that direction.