Closed eric-czech closed 4 years ago
I am also thinking that getting a very accurate reproduction (to floating point precision) is unlikely. If we want to describe how close our results end up being, we probably need some point of reference like showing that our results are much closer to the NealeLab v3 results than X. I'm not sure what that X should be yet.
It looks like all the NealeLab Dropbox links are broken in both the GWAS v3 Results Spreadshet and the Pan-Ancestry Results Spreadhseet. I could have sworn that was working just a week ago when I was downloading a few files to investigate. Maybe they moved that data?
I sent an email to nealelab.ukb@gmail.com per their FAQ on data questions but it would be great to figure out what's going on sooner than later since the timing is unfortunate. I was just getting to a phase where I need it.
I don't think I can get the per-phenotype sample lists or the QC variant list from anywhere but their primary result set though it looks like joining OTG v2d on the mappings in EFO-UKB-mappings could work for summary stats. For whatever reason, they don't put the actual UKB phenotype codes in the map and instead call them "bioentities" like "ukbiobank_1". It could be that the associations are with ICD10 codes that aren't necessarily 1:1 with data field ids. There also appears to be some link to trait codes in EFO-UKB-mappings#UK_Biobank_master_file.tsv but I'm not sure how that links back to the data field ids yet either.
There also appears to be some link to trait codes in EFO-UKB-mappings#UK_Biobank_master_file.tsv but I'm not sure how that links back to the data field ids yet either.
Some of the UKB field ids are in there, as it turns out, but only for the traits that are already roughly equal to individual phenotypes. I was thrown off by the fields like 41202 (ICD10 codes) and 20002 which are broken out into separate phenotypes with codes that just don't look like UKB field ids.
Having access to the NealeLab results would still be better, but that's at least something to work off of in joining to OTG for summary stats.
One last wrinkle with OTG: the coordinates are GRCh38 while the UKB data is all GRCh37.
I'll stop pressing on this for now and wait until we figure out what happened to the NealeLab results.
Another option may be to contact the OTG team or Ed Mountjoy directly to ask if they have a copy of the Neale Lab results up on GCS some place. They likely also have some thoughts on how to prepare this data.
re: OTG
You can get sumstats for a certain phenotype like this: gsutil -u $PROJECT_ID cp gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/981.neale2.gwas.imputed_v3.both_sexes.tsv.gz .
.
re: UKB Pan-Ancestry
Summary stats are available (also requester pays) at gs://ukb-diverse-pops-public/sumstats_release/results_full.mt
. This is apparently 12T! See https://pan.ukbb.broadinstitute.org/docs/hail-format/index.html.
I've got a note out to Konrad asking about the broken Dropbox links.
Also, any reason you're using "NealeLab" and not "Neale lab"?
Okay got some good info from Konrad:
Yep, we got banned by Dropbox. We're trying to get them restored but it may take a while, and we're exploring other hosting options.
You've already found the UKB Pan-Ancestry GCS data, which is where Konrad pointed me.
For the Neale lab results, he pointed me to gs://hail-datasets
, specifically these files:
gs://hail-datasets/ukbb_imputed_v3_gwas_results_both_sexes.GRCh37.mt/ gs://hail-datasets/ukbb_imputed_v3_gwas_results_both_sexes.GRCh38.liftover.mt/ gs://hail-datasets/ukbb_imputed_v3_gwas_results_female.GRCh37.mt/ gs://hail-datasets/ukbb_imputed_v3_gwas_results_female.GRCh38.liftover.mt/ gs://hail-datasets/ukbb_imputed_v3_gwas_results_male.GRCh37.mt/ gs://hail-datasets/ukbb_imputed_v3_gwas_results_male.GRCh38.liftover.mt/
It looks like you may be able to avoid the OT G issues w/ GRCh37 vs. GRCh38.
Going to close this out for now, since the OT results are broken out by phenotype and are already in a more convenient form (GRCh37 tsv). We'll have to pay to get them but it won't be much and I can download the separate files for comparison as needed. Exactly what phenotypes we want to ultimately compare results to is still up in the air -- I'll leave that to another issue.
For posterity:
gsutil -u $PROJECT_ID du -ch gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/
...
896.5 MiB gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/894.neale2.gwas.imputed_v3.both_sexes.tsv.gz
845.67 MiB gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/904.neale2.gwas.imputed_v3.both_sexes.tsv.gz
834.12 MiB gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/914_raw.neale2.gwas.imputed_v3.both_sexes.tsv.gz
890.97 MiB gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/924.neale2.gwas.imputed_v3.both_sexes.tsv.gz
831.42 MiB gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/93_raw.neale2.gwas.imputed_v3.both_sexes.tsv.gz
833.22 MiB gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/943.neale2.gwas.imputed_v3.both_sexes.tsv.gz
833.23 MiB gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/94_raw.neale2.gwas.imputed_v3.both_sexes.tsv.gz
842.15 MiB gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/971.neale2.gwas.imputed_v3.both_sexes.tsv.gz
835.85 MiB gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/981.neale2.gwas.imputed_v3.both_sexes.tsv.gz
836.44 MiB gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/991.neale2.gwas.imputed_v3.both_sexes.tsv.gz
3.86 TiB gs://genetics-portal-raw/uk_biobank_sumstats/neale_v2/output/neale_v2_sumstats/
3.86 TiB total
I would like to eventually compare summary stats to something. Options are:
Good contextual comparisons (but not our primary comparison) would be:
I would assume using the NealeLab results directly would be easiest because they're organized by UKB phenotype id and not EFO ID.