thelovelab / fishpond

Differential expression and allelic analysis, nonparametric statistics
https://thelovelab.github.io/fishpond
27 stars 9 forks source link

Extend to ordinal conditions #25

Open JosephLalli opened 2 years ago

JosephLalli commented 2 years ago

Briefly looking through the code here, it seems like the basic algorithm (correlation + bootstrapping) is extendable to more than two conditions. This would in effect turn your DE software into a potential engine for uncertainty-aware eQTL/sQTL analyses.

I’m sure I’ve missed some major hurdle - as this work is at the far end of my bioinformatics knowledge - but if not, is this a feature you are considering adding?

mikelove commented 2 years ago

Great question.

We’ve recently been implementing and then assessing correlation tests including Pearson and Spearman. I believe these would basically give the detection power of ordinal testing.

I like the idea of including this in an eQTL or sQTL framework. We’d have to allow for changing covariates per row of the matrix, but it’s not out of the realm of possibility… What does your genotype matrix look like?

JosephLalli commented 2 years ago

The eQTL tools I work with will take a variety of inputs, but all of them convert the genotypes to a data frame or array of 0/1/2 values (g x n), where n = # of samples, and g = number of SNPs within 1e6 bp of the transcriptional start site of the gene of interest.

The basic algorithm employed by matrixQTL (R), fastQTL (R+multiple comparison correction via permutation), and tensorQTL(python, correction, GPU based):

It seems to me like fishpond's methods would improve the accuracy of the regression step, and the rest of it is just housekeeping. I know that I've struggled with false positive associations that are due to outlier/extreme samples skewing my data, and fishpond seems well suited to addressing that problem.

The thing I'm unsure about is how easy it would be to adapt a method designed to perform a phe ~ 0/1 significance analysis to phe ~ 0/1/2 or phe ~ continuous analysis.

mikelove commented 2 years ago

We have already tested a lot the phenotype ~ integer, or phenotype ~ continuous, with time series and pseudotime analyses respectively. Things look good in simulation and the real data results look nice as well 😃

Lemme loop back here next week for more thoughts

mikelove commented 1 year ago

hi @JosephLalli

I wanted to return to this. We've been thinking a lot in the lab about different aspects of modeling QTL. We've focused on distributional questions lately, and less on uncertainty in quantification.

I think the fishpond framework is strong, but when you want to add in a lot of covariates (as we often need PCs, factors of unwanted expression variation etc.), the non-parametric framework starts to be less useful.

Happy to chat sometime, but I think we won't be extending fishpond in this direction, but instead focused on other modeling aspects in the future.

JosephLalli commented 1 year ago

Your timing is uncanny @mikelove - I've also been coming back around this idea. I'd be curious to talk more here vs email about the strengths and drawbacks of using non-parametric methods of calculating mean & var for genotype-phenotype associations.

My intuition was that using bootstraps would help address problems I've been encountering with reference bias* and outlier expression values creating false positive results (especially if looking at differential isoform usage). If your group has encountered difficulties applying this method, I'd love to talk more about it.

*Reference bias issues & associated high expression rates of pseudogenes have been a big problem for my dataset. I'm also experimenting with using a modified version of the SEESAW/g2g tools pipeline to address this problem.

mikelove commented 1 year ago

Let’s chat on zoom as I think there’s a lot to discuss, I’m at ENAR until Wednesday. What’s a good email to reach out?

JosephLalli commented 1 year ago

Hi @mikelove, you can reach me at Lalli at wisc dot edu.