Clarification input files V2D data pipeline

BjornWouters commented 5 years ago

There are a couple of files which I cannot relate to any publicly available file published from the GWAS catalog or the Neale UKBB dataset needed for running the V2D pipeline.

Some files that are available from the Google storage are not directly downloadable from the direct source. I wasn't able to locate the following files that have been used as input for the V2D pipeline:

Files used for fine mapping: neale_ukb_uk10kref_180502.crediblesets.long.varids.tsv.gz 50.nealeUKB_20170915.assoc.clean.tsv.gz

UK biobank manifest files: phenosummary_final_11898_18597.fixed.curation_manifest - QC of ICD10 traits - Iteration 1.csv phenosummary_final_11898_18597.fixed.curation_manifest - QC of self-reported traits - Iteration 1.csv neale_cateogries.v2.tsv (There's a typo in there)

Variant index file variant-annotation.sitelist.tsv.gz

Is there a possibility to clarify what the origin of those files are and possible transformations that have been made to create them?

Thanks in advance,

Bjorn

forus commented 5 years ago

@edm1 Hi Ed, are there scripts or maybe some documentation on how to produce those input files?

edm1 commented 5 years ago

Hi @BjornWouters,

For completeness sake here are descriptions of the files you asked about:

neale_ukb_uk10kref_180502.crediblesets.long.varids.tsv.gz
- deprecated in recent release!
- description: Output from the old fine mapping pipeline (tag: v1.0). Contains credible sets for all top loci.
- location: gs://genetics-portal-input/uk_biobank_analysis/em21/neale_summary_statistics_20170915/finemapping/results/neale_ukb_uk10kref_180502.crediblesets.long.varids.tsv.gz
50.nealeUKB_20170915.assoc.clean.tsv.gz
- deprecated in recent release!
- decription: an example summary stat input file from the old fine mapping pipeline. Required to map variant IDs to rsids in fine mapping output.
- location: copied to gs://genetics-portal-input/uk_biobank_data/em21/neale_summary_statistics_20170915/cleaned_data/50.nealeUKB_20170915.assoc.clean.tsv.gz
phenosummary_final_11898_18597.fixed.curation_manifest - QC of ICD10 traits - Iteration 1.csv
- deprecated in recent release!
- decription: manual curation of EFOs for Neale lab ICD10 traits
- location: gs://genetics-portal-input/uk_biobank_data/em21/neale_summary_statistics_20170915/efo_curation/v1_2018_07_05/phenosummary_final_11898_18597.fixed.curation_manifest - QC of ICD10 traits - Iteration 1.csv
phenosummary_final_11898_18597.fixed.curation_manifest - QC of self-reported traits - Iteration 1.csv
- deprecated in recent release!
- decription: manual curation of EFOs for Neale lab self reported traits
- location: gs://genetics-portal-input/uk_biobank_data/em21/neale_summary_statistics_20170915/efo_curation/v1_2018_07_05/phenosummary_final_11898_18597.fixed.curation_manifest - QC of self-reported traits - Iteration 1.csv
neale_cateogries.v2.tsv
- deprecated in recent release!
- decription: result of a manual curation of Neale study IDs with broad categories for the phewas plot
- location: gs://genetics-portal-input/uk_biobank_data/em21/neale_summary_statistics_20170915/PheWAS_categories/neale_cateogries.v2.tsv
variant-annotation.sitelist.tsv.gz
- not deprecated
- description: a sitelist of variants in our variant index, required to map GWAS catalog variant IDs to IDs in our variant index. Output by my variant annotation pipeline
- location: gs://genetics-portal-data/variant-annotation/190129/variant-annotation.sitelist.tsv.gz

However, as I have just mentioned here the scripts in this repo are highly bespoke. They normalise the data for input to genetics pipe. Different inputs would require processing suitable for that specific dataset. The only reason to run this pipeline is to reproduce the files contained in gs://genetics-portal-data/v2d/.

edm1 commented 5 years ago

I have just realised you don't have access to gs://genetics-portal-input or gs://genetics-portal-data. Fyi, Snakemake doesn't work with "Requester Pays" buckets which is why we haven't made all buckets public.

It appears @mkarmona is copying over to a single release bucket: gs://open-targets-genetics-releases. I will clarify with him then copy over any required files.

forus commented 5 years ago

@edm1 Thanks a lot!

opentargets / issues

Clarification input files V2D data pipeline #2360