Prototyping release data folder structure

DSuveges commented 5 months ago

Currently the input/output files are a bit of all over the place. We need to came up with a plan that sensiblely sorts them out.

Expectation:

[ ] ETL output data.
[ ] ETL input files produced by upstream processes.
[ ] Input for upstream processes.
[ ] Location of datasets that are not expected to updated for each release.

Also config files needs to be updated accordingly.

DSuveges commented 5 months ago

We have different data sources some of which we are generating for ourselves, some are completely static, some are somewhat static, while others are highly dynamic. To make a consistent representation of this complexity, I think we need to completely separate data sources and the processes on them from actual data release. I think we should organise data as follows for big sources:

|- dataset
    |- raw (optional, depends on source, any format)
    |- harmonised_summary_statistics
    |- manifests 
    |- study-index (parquet)
    |- study_locus (parquet)
    |- credible_sets (parquet)

Which would mean the following:

GWAS Catalog:

|- gwas_catalog
    |- raw-harmonised-summary-statistics (harmonised summary statistics in .tsv.gz)
    |- harmonised-summary-statistics (processed summary statistics in parquet)
    |- raw-curation
        |- gwas_catalog_curated_associations.tsv
        |- gwas_catalog_studies.tsv
        |- gwas_catalog_unpublised_studies.tsv
        |- gwas_catalog_ancestries.tsv
        |- gwas_catalog_unpublished_ancestries.tsv
        |- gwas_catalog_summary_statistics_look_up_table.txt
    |- manifests
        |- gwas_catalog_harmonised_summary_stats.txt
        |- gwas_catalog_curation_excluded_studies
        |- gwas_catalog_curation_ingested_studies
        |- gwas_catalog_sumstats_excluded_studies
        |- gwas_catalog_sumstats_ingested_studies
    |- study-index (parquet)
    |- study-locus (parquet)
        |- gwas_catalog_curation
        |- gwas_catalog_summary_statistics (window based clumped)
    |- credible-sets (parquet)
        |- gwas_catalog_curation (ld-based clumped, pics)
        |- gwas_catalog_summary_statistics (ld-based clumped, pics)

The gwas_catalog_preprocess and the gwas_catalog_harmonisation DAGs would work on these folders.

Finngen:

|- finngen
    |- r10
        |- harmonised-summary-statistics (parquet)
        |- study-index (parquet)
        |- study-locus (parquet)
            |- finngen_summary_statistics (window based clumped)
        |- credible-sets (parquet)
            |- finngen_summary_finemapping (susie)
            |- finngen_summary_statistics (ld-based clumped, pics)

We don't need folder for raw, as it is already on GCP.
For each rease the data can be considered static

eQTL Catalog:

|- eqtl-catalog
    |- manifests
        |- tabix_ftp_paths_imported.tsv (file listing resources to ingest)
    |- study_index (parquet)
    |- study_locus (parquet)
        |- eqtl-caltalog (window based clumped)
    |- credible-sets (parquet)
        |- eqtl-catalog (window based clumped)

Static assets

There are number of datasets that are completely static:

|- static_assets
    |- variant_annotation (gnomad version should be noted)
    |- ld_index (gnomad version should be noted)
    |- interval_inputs
        |- anderson
        |- javierrre
        |- jung
        |- thurman
    |- chain_file(s)
    |- consequence_map

Release

Upon starting a release, the process should collect necessary input files from datasource specific folders:

|- 24.01
    |- manifests
        |- gwas_catalog_harmonised_summary_stats.txt
        |- gwas_catalog_curation_excluded_studies
        |- gwas_catalog_curation_ingested_studies
        |- gwas_catalog_sumstats_excluded_studies
        |- gwas_catalog_sumstats_ingested_studies
    |- inputs
        |- l2g_gold_standard
        |- gene_interaction (from platform)
        |- target_index (from platform)
        |- study_index
            |- gwas_catalog
            |- finngen
            |- eqtl_catalog
        |- credible_sets
            |- gwas_catalog_curation (ld-based clumped, pics)
            |- gwas_catalog_summary_statistics (ld-based clumped, pics)
            |- finngen (ld-based clumped, pics)
            |- eqtl_catalog (ld-based clumped, pics)
    |- outputs
        |- colocalisation
        |- locus2gene_model
        |- locus2gene_prediction
        |- variant2gene
        |- variant_index

The manifest folder would be a collection of manifest files from all sources
The input folder would contain datasets changing from release to release + ETL input apart from static assets.

DSuveges commented 5 months ago

Important to note that the output folder does not contain summary statistics. That piece of data can be shared with partners as links pointing to the right dataset.

DSuveges commented 5 months ago

Conventions we might to decide to follow:

[ ] No underscores just hyphens in folder names (how about files?)
[ ] All abbreviations are expanded
[ ] All lowercase
[ ] The main config yaml should contain as little configuratin as possible, eg. only the datasource_bucket and release. The structure of the datasource bucket should be implicitly coded by the step specific configuration.

addramir commented 5 months ago

Looks good. Minor comment - maybe not really related to the structure - l2g_gold_standard and l2g_model are the data that is going to be available only for partners for a while - keep it in the place with unified sumstats?

ireneisdoomed commented 5 months ago

@addramir Isn't all data going to be open to partners for now? I understand that the release bucket won't be public for now.

addramir commented 5 months ago

We have to decide. If we want all data (including LDIndex and V2G) to be private - then yes, all data will be open to partners. If we want some of the data to be public and the rest private - we probably need to change the structure. But after formulating this question I realised that it is too difficult. Let us keep ALL the data private for partners only. And we will give access to the whole bucket. We need to tag Annalisa here.

ireneisdoomed commented 5 months ago

This is great @DSuveges!

Re: the suggested structure for the release bucket

Is there a major conceptual difference between manifests and inputs? Not clear to me. We could bring manifests inside inputs.
I'd also add gene_index. Pipeline-wise, it is only used for V2G but the metadata could be useful to enrich the result of L2G predictions, e.g. I don't see any reason against, it's a 3Mb file.
I understand why credible_set is part of inputs. However, for such a pivotal dataset, I think it is weird that it is not part of the outputs. Data consumers don't need to understand the separation between inputs/outputs. This dataset is currently 4Gb.

For the credible_set dataset, I'd maintain the structure inside the directory to separate between associations coming from summary statistics vs non-ss.

# nested
credible-set/
--- gwas-catalog-curation
--- from-summary-statistics
--- --- gwas-catalog
--- --- eqtl-catalogue
--- --- finngen
# or flattened
credible-set/
--- gwas-catalog-curation
--- gwas-catalog-summary-statistics
--- eqtl-catalogue-summary-statistics
--- finngen-summary-statistics

General comments

While it totally makes sense, the only thing I don't like about this source-derived structure is that we won't be able to provide harmonised summary statistics datasets under a common root. It depends on the application, but it seems interesting to me to read summary statistics all at once without browsing too much.
Agree with the lack of abbreviations. I'd suggest that we also avoid using 2 for to.
I'd use the same casing for folders and files. I have a preference for snake case (because that's what I'm used to), but no strong feelings.
It doesn't look like we need a "Genetics input support" as such to create a release (everything after preprocessing). All necessary files are in GCP, copying manifests and inputs into the release bucket could be a task in Airflow right after the cluster creating. What do you think?
In summary, assuming we have a central bucket for summary statistics sharing, my understanding is that we'll share 3 buckets with partners:
```
static_assets (with variant_annotation, and ld_index)
harmonised_summary_statistics
release_bucket
```

DSuveges commented 5 months ago

@addramir, @ireneisdoomed I'm trying to answer all the questions:

l2g_gold_standard and l2g_model are the data that is going to be available only for partners for a while - keep it in the place with unified sumstats? ... If we want some of the data to be public and the rest private - we probably need to change the structure

In the current release any of these folders are only shared with partners. Later we can decide how to increase granularity on the access. I see value making the model only available for partners, while still allowing everyone to train their own. The same with summary statistics. At this point this is not a question, I think up until there's no actual publicly available product built on top of this dataset, we don't need to release publicly.

Is there a major conceptual difference between manifests and inputs?

Manifests are not really input. They are rather explanation on what was in the input. In general, I don't really like the idea of manifests, but in this case it is worth sharing why a certain summary statistics was not ingested. But again, this is not 100% clear on what and how should also be shared. I think it would be great if the release folder would contain all parameters that were used for the pipelines eg. r2 thresholds etc.

I'd also add gene_index. Pipeline-wise, it is only used for V2G but the metadata could be useful to enrich the result of L2G predictions, e.g. I don't see any reason against, it's a 3Mb file.

I was thinking about it, but "enriching" doesn't mean anything in our current setting, because there's no product built on top of the etl output. Once there would be an API, we'll need to provide the specific aggregations (which might include datasets enriched with target metadata) anyway. But we are not there yet.

However, for such a pivotal dataset, I think it is weird that it is not part of the outputs.

We had a chat about it with @d0choa and we concluded that there is no clear distinction between input and output in case of genetics. There would be no such separation.

|- 24.01
    |- manifests
        |- gwas_catalog_harmonised_summary_stats.txt
        |- gwas_catalog_curation_excluded_studies
        |- gwas_catalog_curation_ingested_studies
        |- gwas_catalog_sumstats_excluded_studies
        |- gwas_catalog_sumstats_ingested_studies
    |- l2g_gold_standard
    |- gene_interaction (from platform)
    |- target_index (from platform)
    |- study_index
        |- gwas_catalog
        |- finngen
        |- eqtl_catalog
    |- credible_sets
        |- gwas_catalog_curation (ld-based clumped, pics)
        |- gwas_catalog_summary_statistics (ld-based clumped, pics)
        |- finngen (ld-based clumped, pics)
        |- eqtl_catalog (ld-based clumped, pics)
    |- colocalisation
    |- locus2gene_model
    |- locus2gene_prediction
    |- variant2gene
    |- variant_index

For the credible_set dataset, I'd maintain the structure inside the directory to separate between associations coming from summary statistics vs non-ss

Why? Is there a purpose to code this information in the folder structure? Any metadata that would affect the downstream applicability of the dataset should be in the data itself.

While it totally makes sense, the only thing I don't like about this source-derived structure is that we won't be able to provide harmonised summary statistics datasets under a common root. It depends on the application, but it seems interesting to me to read summary statistics all at once without browsing too much.

I disagree. Currently, we don't have a product that would use a shared summary statistics dataset. At this point creating this view on the data for each release is just wasting a lot of disk space. By knowing the location of all harmonised summary statistics one still can read the entire dataset.

Agree with the lack of abbreviations. I'd suggest that we also avoid using 2 for to.

OK, make sense. I just didn't want to be that radical. :D

I'd use the same casing for folders and files. I have a preference for snake case (because that's what I'm used to), but no strong feelings.

Same.

It doesn't look like we need a "Genetics input support"

No totally not. I meant we only need to have a process in place that does what an input support would do. In this case the logic is extremely slim.

ireneisdoomed commented 5 months ago

Thank you for the clarifications @DSuveges

Why? Is there a purpose to code this information in the folder structure? Any metadata that would affect the downstream applicability of the dataset should be in the data itself.

To be more transparent about data provenance. I agree that the filename is not the best place for this to live, but until this is not in the data, I don't see a reason against it.

I don't particularly love having L2G inputs in the root folder, but it's fine.

d0choa commented 5 months ago

In the future, this might change but I believe the proposed structure is the right one for now.

Inputs/outputs are artificial constructs that might have different meanings depending on the use or the process that creates them. Something might be an input of a step/process but simultaneously be an "output" dataset for the purpose of some applications (e.g. web). This mimics the flat structure we have in the platform which I would say had more benefits than disadvantages.

I also like the fact that throughout the last months, we didn't generate a lot of "unnecessary" intermediate datasets. The structure looks quite clean and relatively easy to understand as opposed to the production genetics portal.

DSuveges commented 5 months ago

Release output strucure:

PR #425

As a gradual update for the restructuring the data input/output of the pipelines, I changed the following:

Removed remaining uk biobank configs.
Introduced "release_folder" based on a root + data release version.
Changed outputs and inputs of those datasets that are expected to showing up in the release folder (would be part of the release)

Namely:

Study index - data copied.
Credible sets - data copied.
locus to gene gold standard data copied.
locus to gene model: expected to be generated by the etl.
locus to gene prediction: expected to be generated by the etl.
variant index: expected to be generated by the etl.
variant to gene: expected to be generated by the etl.

GWAS Catalog structure

PR: #426

New location for all gwas catalog data: gs://gwas_catalog_data
All harmonised (harmonised_summary_statistics, ~5.7TB) and pre-harmonised (raw_summary_statistics, ~7.1TB) summary statistics are moved here.
Curated data is located under gs://gwas_catalog_data/curated_inputs/
The update of these files are done by calling update_GWAS_Catalog_data.sh script in the utils folder.
The fetched files are no longer versioned or time-stamped. They have constant names that the main configuration can refer to (no updates is required).
The version logs generated upon data update is uploaded: manifests/GWAS_Catalog_curated_data_update.log. This file contains GWAS Catalog release date, version etc.
The update script also saves a snapshot from the study curation file from the curation repo into the manifest folder.
All files have underscores now.
The gwas preprocess dag runs.
The gwas harmonisation dag was also updated but not sure if runs as I could not update the raw sumstats folder (no scrum access)
As the business logic of the process did not change, I didn't do any deep QC on the results.

GWAS Catalog bucket structure:

gs://gwas_catalog_data/credible_set_datasets/
gs://gwas_catalog_data/credible_set_datasets/gwas_catalog_curated
gs://gwas_catalog_data/credible_set_datasets/gwas_catalog_summary_stats
gs://gwas_catalog_data/curated_inputs/
gs://gwas_catalog_data/harmonised_summary_statistics/
gs://gwas_catalog_data/manifests/
gs://gwas_catalog_data/raw_summary_statistics/
gs://gwas_catalog_data/study_index/
gs://gwas_catalog_data/study_locus_datasets/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_curated_associations/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_curated_associations_ld_clumped/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_summary_stats_ld_clumped/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_summary_stats_window_clumped/

The data in the study_index , credible_set_datasets, manifests and study_locus_datasets folders are regenerated by the gwas pre-process dag. The content of these folders can be propagated upon running a release.

Contenst of the manifests folder

gwas_catalog_data_update.log - the log file generated upon refreshing curated GWAS Catalog data.
gwas_catalog_harmonised_sumstats_list.txt - list of studies with harmonised summary statistics we have ingested.
gwas_catalog_study_curation.tsv - curation table we generated in-house for studies with summary statistics
gwas_catalog_curated_included_studies - list of study ids that eligible for ingestion in the curated path.
gwas_catalog_curation_excluded_studies - studies that were excluded from ingestion in the curated path + annotation on why the exclusion happened.
gwas_catalog_summary_statistics_excluded_studies - study ids that were excluded from summary statistics ingestion.
gwas_catalog_summary_statistics_included_studies - study ids that were eligible for summary statistics ingestion.

DSuveges commented 5 months ago

For finngen each data freeze can go into a separate folder:

gs://finngen_data/r10/
gs://finngen_data/r10/credible_set_datasets/
gs://finngen_data/r10/harmonised_summary_statistics/
gs://finngen_data/r10/study_index/
gs://finngen_data/r10/study_locus_datasets/

The content of the study locus folder indicates the general stepps executed on the summary statistics:

gs://finngen_data/r10/study_locus_datasets/
gs://finngen_data/r10/study_locus_datasets/finngen_summary_statistics_ld_clumped/
gs://finngen_data/r10/study_locus_datasets/finngen_summary_statistics_window_clumped/

While the credible sets dataset contains the picsed dataset, with potential update from the ingested finemapping dataset:

gs://finngen_data/r10/credible_set_datasets/finngen_summary_statistics_pics/

DSuveges commented 2 months ago

Done.

opentargets / issues