Closed project-defiant closed 1 month ago
To generate new curation file:
make build
from gentropy dev version - this will upload the gentropy package
and config into the gcs bucket. These files will be synced to the dataproc cluster master and workers environments and gentropy will be installed via the install_dependencies pig jobThe result from the dag run is the gs://genetics_etl_python_playground/input/v2d/genetics_etl_python_playground/input/v2d/GWAS_Catalog_study_curation_2024-08-12.tsv
file. I have checked the file and found that there are ~130 duplicated study IDs -
GWAS_catalog_study_curation_cuplidates_2024-08-12.txt.
The duplicates come from published and unpublished summary statistics (from GWAS Catalog metadata files). Both have the same sumstats as published and not published.
After removal the complete list of study IDs have 45173
unique study Ids. The list can be found in gs://genetics_etl_python_playground/input/v2d/genetics_etl_python_playground/input/v2d/GWAS_Catalog_study_curation_dedup_2024-08-12.tsv
. Also updated the curation list on google docs.
@xyg123 the prediction step was successful. You can check the results at
gs://genetics_etl_python_playground/releases/24.08+szsz/
After result validation it was pointed by @xyg123 , that all of the the PICS credible sets have PosteriorProbability equal to 1. Genetics team tried to track the issue behind this PP distribution. Rerunning the Gwas_catalog_processing DAG fixed the issue.
Although new issue with Ecaviar colocalisation step was raised, when submitted recalculated credible sets.
After discussion with the genetics team, the release is postponed.
As a developer I want describe what internal data release 2024.08 should cover
Background
Release intentions for 2024.08 genetics data release
Tasks
@addramir @tskir @Daniel-Considine @xyg123 FYI