opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Prototyping release data folder structure #3193

Closed DSuveges closed 2 months ago

DSuveges commented 5 months ago

Currently the input/output files are a bit of all over the place. We need to came up with a plan that sensiblely sorts them out.

Expectation:

Also config files needs to be updated accordingly.

DSuveges commented 5 months ago

We have different data sources some of which we are generating for ourselves, some are completely static, some are somewhat static, while others are highly dynamic. To make a consistent representation of this complexity, I think we need to completely separate data sources and the processes on them from actual data release. I think we should organise data as follows for big sources:

|- dataset
    |- raw (optional, depends on source, any format)
    |- harmonised_summary_statistics
    |- manifests 
    |- study-index (parquet)
    |- study_locus (parquet)
    |- credible_sets (parquet)

Which would mean the following:

GWAS Catalog:

|- gwas_catalog
    |- raw-harmonised-summary-statistics (harmonised summary statistics in .tsv.gz)
    |- harmonised-summary-statistics (processed summary statistics in parquet)
    |- raw-curation
        |- gwas_catalog_curated_associations.tsv
        |- gwas_catalog_studies.tsv
        |- gwas_catalog_unpublised_studies.tsv
        |- gwas_catalog_ancestries.tsv
        |- gwas_catalog_unpublished_ancestries.tsv
        |- gwas_catalog_summary_statistics_look_up_table.txt
    |- manifests
        |- gwas_catalog_harmonised_summary_stats.txt
        |- gwas_catalog_curation_excluded_studies
        |- gwas_catalog_curation_ingested_studies
        |- gwas_catalog_sumstats_excluded_studies
        |- gwas_catalog_sumstats_ingested_studies
    |- study-index (parquet)
    |- study-locus (parquet)
        |- gwas_catalog_curation
        |- gwas_catalog_summary_statistics (window based clumped)
    |- credible-sets (parquet)
        |- gwas_catalog_curation (ld-based clumped, pics)
        |- gwas_catalog_summary_statistics (ld-based clumped, pics) 

Finngen:

|- finngen
    |- r10
        |- harmonised-summary-statistics (parquet)
        |- study-index (parquet)
        |- study-locus (parquet)
            |- finngen_summary_statistics (window based clumped)
        |- credible-sets (parquet)
            |- finngen_summary_finemapping (susie)
            |- finngen_summary_statistics (ld-based clumped, pics)  

eQTL Catalog:

|- eqtl-catalog
    |- manifests
        |- tabix_ftp_paths_imported.tsv (file listing resources to ingest)
    |- study_index (parquet)
    |- study_locus (parquet)
        |- eqtl-caltalog (window based clumped)
    |- credible-sets (parquet)
        |- eqtl-catalog (window based clumped)

Static assets

There are number of datasets that are completely static:

|- static_assets
    |- variant_annotation (gnomad version should be noted)
    |- ld_index (gnomad version should be noted)
    |- interval_inputs
        |- anderson
        |- javierrre
        |- jung
        |- thurman
    |- chain_file(s)
    |- consequence_map

Release

Upon starting a release, the process should collect necessary input files from datasource specific folders:

|- 24.01
    |- manifests
        |- gwas_catalog_harmonised_summary_stats.txt
        |- gwas_catalog_curation_excluded_studies
        |- gwas_catalog_curation_ingested_studies
        |- gwas_catalog_sumstats_excluded_studies
        |- gwas_catalog_sumstats_ingested_studies
    |- inputs
        |- l2g_gold_standard
        |- gene_interaction (from platform)
        |- target_index (from platform)
        |- study_index
            |- gwas_catalog
            |- finngen
            |- eqtl_catalog
        |- credible_sets
            |- gwas_catalog_curation (ld-based clumped, pics)
            |- gwas_catalog_summary_statistics (ld-based clumped, pics)
            |- finngen (ld-based clumped, pics)
            |- eqtl_catalog (ld-based clumped, pics)
    |- outputs
        |- colocalisation
        |- locus2gene_model
        |- locus2gene_prediction
        |- variant2gene
        |- variant_index
DSuveges commented 5 months ago

Important to note that the output folder does not contain summary statistics. That piece of data can be shared with partners as links pointing to the right dataset.

DSuveges commented 5 months ago

Conventions we might to decide to follow:

addramir commented 5 months ago

Looks good. Minor comment - maybe not really related to the structure - l2g_gold_standard and l2g_model are the data that is going to be available only for partners for a while - keep it in the place with unified sumstats?

ireneisdoomed commented 5 months ago

@addramir Isn't all data going to be open to partners for now? I understand that the release bucket won't be public for now.

addramir commented 5 months ago

We have to decide. If we want all data (including LDIndex and V2G) to be private - then yes, all data will be open to partners. If we want some of the data to be public and the rest private - we probably need to change the structure. But after formulating this question I realised that it is too difficult. Let us keep ALL the data private for partners only. And we will give access to the whole bucket. We need to tag Annalisa here.

ireneisdoomed commented 5 months ago

This is great @DSuveges!

Re: the suggested structure for the release bucket

General comments

DSuveges commented 5 months ago

@addramir, @ireneisdoomed I'm trying to answer all the questions:

l2g_gold_standard and l2g_model are the data that is going to be available only for partners for a while - keep it in the place with unified sumstats? ... If we want some of the data to be public and the rest private - we probably need to change the structure

In the current release any of these folders are only shared with partners. Later we can decide how to increase granularity on the access. I see value making the model only available for partners, while still allowing everyone to train their own. The same with summary statistics. At this point this is not a question, I think up until there's no actual publicly available product built on top of this dataset, we don't need to release publicly.

Is there a major conceptual difference between manifests and inputs?

Manifests are not really input. They are rather explanation on what was in the input. In general, I don't really like the idea of manifests, but in this case it is worth sharing why a certain summary statistics was not ingested. But again, this is not 100% clear on what and how should also be shared. I think it would be great if the release folder would contain all parameters that were used for the pipelines eg. r2 thresholds etc.

I'd also add gene_index. Pipeline-wise, it is only used for V2G but the metadata could be useful to enrich the result of L2G predictions, e.g. I don't see any reason against, it's a 3Mb file.

I was thinking about it, but "enriching" doesn't mean anything in our current setting, because there's no product built on top of the etl output. Once there would be an API, we'll need to provide the specific aggregations (which might include datasets enriched with target metadata) anyway. But we are not there yet.

However, for such a pivotal dataset, I think it is weird that it is not part of the outputs.

We had a chat about it with @d0choa and we concluded that there is no clear distinction between input and output in case of genetics. There would be no such separation.

|- 24.01
    |- manifests
        |- gwas_catalog_harmonised_summary_stats.txt
        |- gwas_catalog_curation_excluded_studies
        |- gwas_catalog_curation_ingested_studies
        |- gwas_catalog_sumstats_excluded_studies
        |- gwas_catalog_sumstats_ingested_studies
    |- l2g_gold_standard
    |- gene_interaction (from platform)
    |- target_index (from platform)
    |- study_index
        |- gwas_catalog
        |- finngen
        |- eqtl_catalog
    |- credible_sets
        |- gwas_catalog_curation (ld-based clumped, pics)
        |- gwas_catalog_summary_statistics (ld-based clumped, pics)
        |- finngen (ld-based clumped, pics)
        |- eqtl_catalog (ld-based clumped, pics)
    |- colocalisation
    |- locus2gene_model
    |- locus2gene_prediction
    |- variant2gene
    |- variant_index

For the credible_set dataset, I'd maintain the structure inside the directory to separate between associations coming from summary statistics vs non-ss

Why? Is there a purpose to code this information in the folder structure? Any metadata that would affect the downstream applicability of the dataset should be in the data itself.

While it totally makes sense, the only thing I don't like about this source-derived structure is that we won't be able to provide harmonised summary statistics datasets under a common root. It depends on the application, but it seems interesting to me to read summary statistics all at once without browsing too much.

I disagree. Currently, we don't have a product that would use a shared summary statistics dataset. At this point creating this view on the data for each release is just wasting a lot of disk space. By knowing the location of all harmonised summary statistics one still can read the entire dataset.

Agree with the lack of abbreviations. I'd suggest that we also avoid using 2 for to.

OK, make sense. I just didn't want to be that radical. :D

I'd use the same casing for folders and files. I have a preference for snake case (because that's what I'm used to), but no strong feelings.

Same.

It doesn't look like we need a "Genetics input support"

No totally not. I meant we only need to have a process in place that does what an input support would do. In this case the logic is extremely slim.

ireneisdoomed commented 5 months ago

Thank you for the clarifications @DSuveges

Why? Is there a purpose to code this information in the folder structure? Any metadata that would affect the downstream applicability of the dataset should be in the data itself.

To be more transparent about data provenance. I agree that the filename is not the best place for this to live, but until this is not in the data, I don't see a reason against it.

I don't particularly love having L2G inputs in the root folder, but it's fine.

d0choa commented 5 months ago

In the future, this might change but I believe the proposed structure is the right one for now.

Inputs/outputs are artificial constructs that might have different meanings depending on the use or the process that creates them. Something might be an input of a step/process but simultaneously be an "output" dataset for the purpose of some applications (e.g. web). This mimics the flat structure we have in the platform which I would say had more benefits than disadvantages.

I also like the fact that throughout the last months, we didn't generate a lot of "unnecessary" intermediate datasets. The structure looks quite clean and relatively easy to understand as opposed to the production genetics portal.

DSuveges commented 5 months ago

Release output strucure:

PR #425

As a gradual update for the restructuring the data input/output of the pipelines, I changed the following:

Namely:

GWAS Catalog structure

PR: #426

GWAS Catalog bucket structure:

gs://gwas_catalog_data/credible_set_datasets/
gs://gwas_catalog_data/credible_set_datasets/gwas_catalog_curated
gs://gwas_catalog_data/credible_set_datasets/gwas_catalog_summary_stats
gs://gwas_catalog_data/curated_inputs/
gs://gwas_catalog_data/harmonised_summary_statistics/
gs://gwas_catalog_data/manifests/
gs://gwas_catalog_data/raw_summary_statistics/
gs://gwas_catalog_data/study_index/
gs://gwas_catalog_data/study_locus_datasets/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_curated_associations/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_curated_associations_ld_clumped/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_summary_stats_ld_clumped/
gs://gwas_catalog_data/study_locus_datasets/gwas_catalog_summary_stats_window_clumped/

The data in the study_index , credible_set_datasets, manifests and study_locus_datasets folders are regenerated by the gwas pre-process dag. The content of these folders can be propagated upon running a release.

Contenst of the manifests folder

gwas_catalog_data_update.log - the log file generated upon refreshing curated GWAS Catalog data.
gwas_catalog_harmonised_sumstats_list.txt - list of studies with harmonised summary statistics we have ingested.
gwas_catalog_study_curation.tsv - curation table we generated in-house for studies with summary statistics
gwas_catalog_curated_included_studies - list of study ids that eligible for ingestion in the curated path.
gwas_catalog_curation_excluded_studies - studies that were excluded from ingestion in the curated path + annotation on why the exclusion happened.
gwas_catalog_summary_statistics_excluded_studies - study ids that were excluded from summary statistics ingestion.
gwas_catalog_summary_statistics_included_studies - study ids that were eligible for summary statistics ingestion.
DSuveges commented 5 months ago

For finngen each data freeze can go into a separate folder:

gs://finngen_data/r10/
gs://finngen_data/r10/credible_set_datasets/
gs://finngen_data/r10/harmonised_summary_statistics/
gs://finngen_data/r10/study_index/
gs://finngen_data/r10/study_locus_datasets/

The content of the study locus folder indicates the general stepps executed on the summary statistics:

gs://finngen_data/r10/study_locus_datasets/
gs://finngen_data/r10/study_locus_datasets/finngen_summary_statistics_ld_clumped/
gs://finngen_data/r10/study_locus_datasets/finngen_summary_statistics_window_clumped/

While the credible sets dataset contains the picsed dataset, with potential update from the ingested finemapping dataset:

gs://finngen_data/r10/credible_set_datasets/finngen_summary_statistics_pics/
DSuveges commented 2 months ago

Done.