Closed DSuveges closed 2 months ago
The fix has been prototyped. The before data:
(
spark.read.parquet('/Users/dsuveges/project_data/gentropy/credible_set/gwas_catalog_PICSed_curated_associations')
.filter(
(f.col("beta").isNotNull()) &
(f.col('beta') > -1.5) &
(f.col('beta') < 1.5)
)
.select('beta')
.toPandas()
.hist(bins=100)
)
Which becomes:
(
beta_harmonised_df
.filter(
(f.col("beta").isNotNull()) &
(f.col('beta') > -1.5) &
(f.col('beta') < 1.5)
)
.select('beta')
.toPandas()
.hist(bins=100)
)
The one thing I am not getting from this is that the distribution reported initially looks swapped a bit different then the one you were @DSuveges able to reproduce.
@project-defiant the plots I initially sent were just from a subset of studies that we were interested in looking at rather than the full catalog which is probably why the distribution looks different.
As reported by Jake Fremier, in the PICSed, curated GWAS Catalog association pile, betas are not following the expected distribution:
When looking into the issue, apparently betas are harmonised, however odds ratios are ignored. As the association effect is stored in the same column regardless if it was OR or beta, ORs just propagated as is. The required logic is already in the GWAS Catalog datasource code, just needs to be "turned on".
When it has happened, make sure sufficient testing is also added to the codebase.