jmonlong / sQTLseekeR

R package to detect splicing QTLs (sQTLs)
http://big.crg.cat/computational_biology_of_rna_processing/sqtlseeker
15 stars 14 forks source link

Taking into account covariates when searching for sQTLs #6

Open vsvinti opened 6 years ago

vsvinti commented 6 years ago

Hi there

I am wondering if taking into account covariates is possible to do with sQTLseeker? You mention that the raw counts shouldn't be transformed in any way. Many datasets, however, have underlying structure cause by batch effects, etc, which we may want to correct for so that they don't influence the results.

If this functionality is not available, how to do suggest that one takes this into account? It is possible to generate a residuals matrix with PEER (that can take into account covariates and other hidden structure). Would that be something suitable to use as input into sQTLseeker?

Can you please also comment on what impact on the computations would other ways of transforming the data have, such as between-sample normalisation, and transcript length correction (not necessary for eQTLs)? I thought that raw counts shouldn't be compared directly between samples ..

jmonlong commented 6 years ago

Hi,

All good questions and suggestions! I copied your question on the guigolab/sQTLseekeR repo. It's the most up-to-date repo but more importantly people in the lab are currently working on this exact question. Please follow and respond to this issue.

I'm not involved in these developments but I think a new version of sQTLseekeR will soon be released that supports the inclusion of covariates in the model.

The reason why we recommend using raw counts as inputs is because they are converted into transcript usage ratios. Although I believe these would be less affected by batch effects, it's always possible. If the transcript expression is normalized, the transcript ratios wouldn't represent relative usage anymore. It might still be possible to detect differential usage but we might need to use a different distance computation. Including covariates directly in the model would be the ideal solution I think. To control for ethnicity or admixture that would be very useful though.