karltayeb / cafeh

MIT License
11 stars 0 forks source link

issues running CAFEH : nonsensical results. #2

Open mrotival opened 2 years ago

mrotival commented 2 years ago

Hello,

I'm trying to apply CAFEH to detect multi tissue eQTL across ~15 cell states and I'm encountering several issues at the moment.

1/ First CAFEH seems to multithread without warning the user. which can be solved in my case by setting the following environment variables


```

export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1
export NUMEXPR_NUM_THREADS=1


2/ Second when running CAFEH on permuted data, i find many genes for which one or more components have a p_active>0.99 (and even pip>.99 in many instances), which seems to suggest that it's not specific.

3/ for genes where a strong eQTL can be detected unambiguously by standard mapping approaches (P<10^-30), CAFEH does NOT include the peak eQTL SNP in its 95% CI, despite detecting a single component in some of the tissues where the eQTL is active.

I suspect that I must be doing something wrong (or the code is malfunctional ?). I'd be very happy to exchange on this in order to find what went wrong in my analyses.

toy data to reproduce these issues can be found here.

karltayeb commented 2 years ago

Hi @mrotival

  1. I'm not sure why that might be happening, virtually everything is implemented with basic linear algebra operations in numpy. Is it the default behavior of numpy to multithread?
  2. Having access the example you run would be useful for me to trouble shoot. Unfortunately, I can't seem to access the toy data you provided.
  3. Are you using individual level data or summary stats here? This type of behavior might arise from using summary stats with a mismatched reference LD matrix.
mrotival commented 2 years ago

Dear Karl.

  1. not sure, but it might be, as using multiple cores is an easy way to speed up computation. Here it was only a problem because it used the number of cores available on the node, rather than what's available in the cluster allocation. on a local computer this would not have been an issue. In any case, this can be solved pretty easily by setting environment variables adequately.
  2. I'll try to send the data again ASAP.
  3. everything I've done was with individual data.

Best; Maxime

Le jeu. 6 janv. 2022 à 02:40, Karl Tayeb @.***> a écrit :

Hi @mrotival https://github.com/mrotival

  1. I'm not sure why that might be happening, virtually everything is implemented with basic linear algebra operations in numpy. Is it the default behavior of numpy to multithread?
  2. Having access the example you run would be useful for me to trouble shoot. Unfortunately, I can't seem to access the toy data you provided.
  3. Are you using individual level data or summary stats here? This type of behavior might arise from using summary stats with a mismatched reference LD matrix.

— Reply to this email directly, view it on GitHub https://github.com/karltayeb/cafeh/issues/2#issuecomment-1006215492, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFLHS7ZAVWWJ4XT2WS2MLE3UUTXHXANCNFSM5JZPYFRA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

mrotival commented 2 years ago

here is a wetransfer link with the unzipped toy data

https://we.tl/t-tJaWFjVjKd

Le ven. 7 janv. 2022 à 21:14, Rotival Maxime @.***> a écrit :

Dear Karl.

  1. not sure, but it might be, as using multiple cores is an easy way to speed up computation. Here it was only a problem because it used the number of cores available on the node, rather than what's available in the cluster allocation. on a local computer this would not have been an issue. In any case, this can be solved pretty easily by setting environment variables adequately.
  2. I'll try to send the data again ASAP.
  3. everything I've done was with individual data.

Best; Maxime

Le jeu. 6 janv. 2022 à 02:40, Karl Tayeb @.***> a écrit :

Hi @mrotival https://github.com/mrotival

  1. I'm not sure why that might be happening, virtually everything is implemented with basic linear algebra operations in numpy. Is it the default behavior of numpy to multithread?
  2. Having access the example you run would be useful for me to trouble shoot. Unfortunately, I can't seem to access the toy data you provided.
  3. Are you using individual level data or summary stats here? This type of behavior might arise from using summary stats with a mismatched reference LD matrix.

— Reply to this email directly, view it on GitHub https://github.com/karltayeb/cafeh/issues/2#issuecomment-1006215492, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFLHS7ZAVWWJ4XT2WS2MLE3UUTXHXANCNFSM5JZPYFRA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

karltayeb commented 2 years ago

Hi @mrotival

I was taking a look at your example and a few things come to mind.

First I would recommend normalizing your expression (mean 0, variance 1) before fitting CAFEH. The default initialization of the priors were chosen to make sense in this setting. The variance of expression seems to vary quite a bit across cell types so this may be causing some trouble. I'll update the documentation and provide a warning at runtime to encourage users to normalize their inputs, I'm considering making this the default behavior.

With that in mind, I permuted the sample labels of y 50 times, fit CAFEH, and recorded the number of components CAFEH detected (p_active > 0.95 in at least one study). With normalized inputs, CAFEH does not seem to perform as poorly on the simulations, but it's still not perfect. image

As far the permutation analysis goes, and what could be driving the remaining false positives I think we need to think about (1) the population structure in the data (as indicated by your covariates) and (2) correlated measurement error among celltypes/studies should

I'm not totally sure what role (1) might play but I wouldn't be surprised if it leads to some weird stuff. (2) is a limitation of CAFEH, as it assumes measurement error in each study is independent