Open d0choa opened 1 month ago
This data has been superseded by :
gs://ot_orchestration/releases/24.10_freeze3/
(generated by @project-defiant)
Changelog compared to 24.10_freeze2
:
variant_index
- variants_to_vcf was failing due to heap space A probably incomplete list of things that we know are pending in this dataset from a semantic perspective:
Some finngen and eQTL catalogue credible sets contain duplicated variants inside the locus column PICS results have an issue in which variantId is not shown inside the locus object A fraction of QTLs (incl. iPS cells) are dropped due to EFO-based biosamples GWAS catalog study filtering based on curation + qc_sumstats hasSummaryStats field in the study index to be properly populated L2G evidence + any required adjustments Additional credible set validation (e.g. confirming PP sum 0.99-1) New GWAS Catalog SuSiE fine-mapping (pan-UKB LD reference, etc.)
@DSuveges will report some metrics to make sure data is within the expected ranges
@d0choa https://github.com/opentargets/orchestration/pull/41#event-14696106544 <- configuration update for the last run
24.10_freeze4
> gsutil ls gs://ot_orchestration/releases/24.10_freeze4/
gs://ot_orchestration/releases/24.10_freeze4/locus_to_gene_gold_standard.json
gs://ot_orchestration/releases/24.10_freeze4/biosample_index/
gs://ot_orchestration/releases/24.10_freeze4/colocalisation/
gs://ot_orchestration/releases/24.10_freeze4/credible_set/
gs://ot_orchestration/releases/24.10_freeze4/gene_index/
gs://ot_orchestration/releases/24.10_freeze4/invalid_credible_set/
gs://ot_orchestration/releases/24.10_freeze4/invalid_study_index/
gs://ot_orchestration/releases/24.10_freeze4/locus_to_gene_feature_matrix/
gs://ot_orchestration/releases/24.10_freeze4/locus_to_gene_model/
gs://ot_orchestration/releases/24.10_freeze4/locus_to_gene_predictions/
gs://ot_orchestration/releases/24.10_freeze4/manifests/
gs://ot_orchestration/releases/24.10_freeze4/study_index/
gs://ot_orchestration/releases/24.10_freeze4/variant_index/
gs://ot_orchestration/releases/24.10_freeze4/variants/
Changelog for 24.10_freeze4
:
All the changes above should not structurally change the data, but they will fix some issues in the FE. I mentioned to @jdhayhurst that we have made a schema change in studyIndex
regarding the sum stats QC columns. I can't remember if we picked up these changes for this data release but if you experience issues with study index this is why.
Known issues as 22nd of October:
finemappingMethod
is misspelt SuSie
instead of SuSiE
@jdhayhurst The changes in the study index are refering to the sumstatQCValues
key:value
map we discussed.
Is it finsihed?
30/10: Will wait for 24.10_freeze6 for BE to load the data till Friday. If not, load freeze5
Changelog for 24.10_freeze5
credible_set_qc
step performed on them
30584 rows
263732 rows
gwas_catalog_pics
studyIndex for validation of gwas catalog studiesABNORMAL_PIPS
flag in the credible sets qc flagsgwas_catalog_top_hits
credible sets now have variantId
correctly mapped to studyId
This is a new iteration of the data, making #3567 obsolete:
A frozen description of all schemas can be found here.
The changes in the schemas, added to the ones described in #3567 are:
log2h4h3
has been dropped in thecolocalisation
schemaThe full changelog of this dataset compared to
24.10_freeze1
:biosampleId
log2h4h3
dropped from the coloc schemaA probably incomplete list of things that we know are pending in this dataset from a semantic perspective:
locus
columnThis pipeline run was based on the gentropy
szsz-nonmandatory-params-to-coloc-step
(due to some adjustments required on the coloc priors) anddo_etl_quick_iterations
in the orchestration package. I will try to make the changes todev
in both repository as it gave us an almost perfect run. (@ireneisdoomed, it crashed feature matrix because of the broadcasting issue. I disabled auto-broadcast through configuration similar to train/predict and it worked well)