sgkit-dev / sgkit-publication

Sgkit publication repository
5 stars 5 forks source link

Use HAPNEST data for gwas demo? #43

Open jeromekelleher opened 10 months ago

jeromekelleher commented 10 months ago

The 1 million sample HAPNEST dataset (https://github.com/pystatgen/sgkit/discussions/1144#discussioncomment-7654640 ) seems ideal for our purposes.

Larger thank ukb, and no messing with data access problems. Also lets us showcase our plink format support.

Any thoughts @hammer ?

jeromekelleher commented 10 months ago

Also includes phenotypes, btw

jeromekelleher commented 9 months ago

The advantages of a fully reproducible analysis pipeline to go along with the paper seems compelling to me. Working with something like UKB inevitably introduces friction. This synthetic dataset has been carefully curated for realism, and I'm not sure what extra we'd be showing by working with actual data.

There's a neatness to demonstrating that we can work with two different synthetic datasets at the 1 million sample scale, through both VCF and plink.

If we make it a requirement that all of the things that go into the paper are fully reproducible (which chimes well with the overall philosophy of openness), and we want to do something at the largest scale, then this seems like a great way to go.

hammer commented 9 months ago

I will have a look this week! I've been using GitHub Codespaces so far for my explorations and will need to think about how scaling experiments. We hit some scalability issues last time we tried to do a GWAS at the UKB scale (https://github.com/pystatgen/sgkit/issues/390) so I may also need to get some help resolving those issues.

A quick look at the S-BSST936 listing shows the .bed files range from 141.37 GB (chr2) to 27.64 GB (chr21). I wonder if anyone has put this data on a cloud object store already? I'll poke around a bit to save myself the download time.

two different synthetic datasets at the 1 million sample scale, through both VCF and plink.

@jeromekelleher forgive my ignorance but do we have a VCF synthetic data set at this scale as well?

hammer commented 9 months ago

Some places to look for this data on cloud storage already:

jeromekelleher commented 9 months ago

@jeromekelleher forgive my ignorance but do we have a VCF synthetic data set at this scale as well?

Yep - our data/basic compute task scaling figure goes up to a million samples, taken as subsets of the 1.4M in the simulations provided in this paper

(Note: @benjeffery and I are planning to add another line for the SAV file format/C++ toolkit here. Fig is also quite drafty, obvs)

fig1

hammer commented 9 months ago

Okay figured out their FTP structure, everything is under ftp://ftp.ebi.ac.uk//biostudies/fire/S-BSST/936/S-BSST936/Files. Will start moving to a cloud store now.

For my reference, I'm using a command like:

curl ftp://ftp.ebi.ac.uk//biostudies/fire/S-BSST/936/S-BSST936/Files/example/<file> | gsutil cp - gs://<bucket>/<file>

Transfer speeds not so bad, seeing around 27 MiB/s, will take about 17 minutes for chr21 and probably 2 hours or so for chr1. Will kick off a big transfer tomorrow.

hammer commented 9 months ago

Okay I've gotten our GWAS demo running using one chromosome and one phenotype of the example (600 subjects) data.

Notebook is at https://github.com/hammer/sgkitpub/blob/main/hapnest_gwas.ipynb

Some thoughts:

I will next try to scale to all chromosomes and all phenotypes on the example data, then go to the big dataset.

hammer commented 9 months ago

Just noting for myself that tools from other language ecosystems that might be fun to try out in this section would be https://github.com/privefl/bigsnpr (GWAS docs) and https://github.com/OpenMendel/MendelGWAS.jl