Closed BjornWouters closed 5 years ago
@edm1 Hi Ed, are there scripts or maybe some documentation on how to produce those input files?
Hi @BjornWouters,
For completeness sake here are descriptions of the files you asked about:
neale_ukb_uk10kref_180502.crediblesets.long.varids.tsv.gz
gs://genetics-portal-input/uk_biobank_analysis/em21/neale_summary_statistics_20170915/finemapping/results/neale_ukb_uk10kref_180502.crediblesets.long.varids.tsv.gz
50.nealeUKB_20170915.assoc.clean.tsv.gz
gs://genetics-portal-input/uk_biobank_data/em21/neale_summary_statistics_20170915/cleaned_data/50.nealeUKB_20170915.assoc.clean.tsv.gz
phenosummary_final_11898_18597.fixed.curation_manifest - QC of ICD10 traits - Iteration 1.csv
gs://genetics-portal-input/uk_biobank_data/em21/neale_summary_statistics_20170915/efo_curation/v1_2018_07_05/phenosummary_final_11898_18597.fixed.curation_manifest - QC of ICD10 traits - Iteration 1.csv
phenosummary_final_11898_18597.fixed.curation_manifest - QC of self-reported traits - Iteration 1.csv
gs://genetics-portal-input/uk_biobank_data/em21/neale_summary_statistics_20170915/efo_curation/v1_2018_07_05/phenosummary_final_11898_18597.fixed.curation_manifest - QC of self-reported traits - Iteration 1.csv
neale_cateogries.v2.tsv
gs://genetics-portal-input/uk_biobank_data/em21/neale_summary_statistics_20170915/PheWAS_categories/neale_cateogries.v2.tsv
variant-annotation.sitelist.tsv.gz
gs://genetics-portal-data/variant-annotation/190129/variant-annotation.sitelist.tsv.gz
However, as I have just mentioned here the scripts in this repo are highly bespoke. They normalise the data for input to genetics pipe. Different inputs would require processing suitable for that specific dataset. The only reason to run this pipeline is to reproduce the files contained in gs://genetics-portal-data/v2d/
.
I have just realised you don't have access to gs://genetics-portal-input
or gs://genetics-portal-data
. Fyi, Snakemake doesn't work with "Requester Pays" buckets which is why we haven't made all buckets public.
It appears @mkarmona is copying over to a single release bucket: gs://open-targets-genetics-releases
. I will clarify with him then copy over any required files.
@edm1 Thanks a lot!
There are a couple of files which I cannot relate to any publicly available file published from the GWAS catalog or the Neale UKBB dataset needed for running the V2D pipeline.
Some files that are available from the Google storage are not directly downloadable from the direct source. I wasn't able to locate the following files that have been used as input for the V2D pipeline:
Files used for fine mapping: neale_ukb_uk10kref_180502.crediblesets.long.varids.tsv.gz 50.nealeUKB_20170915.assoc.clean.tsv.gz
UK biobank manifest files: phenosummary_final_11898_18597.fixed.curation_manifest - QC of ICD10 traits - Iteration 1.csv phenosummary_final_11898_18597.fixed.curation_manifest - QC of self-reported traits - Iteration 1.csv neale_cateogries.v2.tsv (There's a typo in there)
Variant index file variant-annotation.sitelist.tsv.gz
Is there a possibility to clarify what the origin of those files are and possible transformations that have been made to create them?
Thanks in advance,
Bjorn