Allow conditioning on a variant

statgen / pheweb

A tool to build a website to browse hundreds or thousands of GWAS.

MIT License

158 stars 65 forks source link

Allow conditioning on a variant #45

Open pjvandehaar opened 7 years ago

pjvandehaar commented 7 years ago

Goncalo says that all you need is r between that variant and each other for your data.

Option 1:

Set up an LD server that references the raw data, probably by copying Daniel's HVCF. Having raw data means that security gets complicated, and I think I don't want that.

Option 2: (Probably)

Pre-compute r for all variant pairs within 300kb from the raw data. ie, for each variant, store r for all variants for the next 300kb.

Ways to store it:

in a tabixed file containing tab-separated rs until they hit 300kb. Easy to make/store, decent to use, probably just 2 sigfigs, should be smaller than matrix.tsv.gz. So, look up the first variant, and iterate through rs at the same time as iterating through sites.tsv.gz until you hit the second variant.
in sqlite3. Two tables, one of [id, chr-pos-ref-alt], another of [variant1_id, variant2_id, r]?

dtaliun commented 7 years ago

Hi Peter,

I have also a pythonic code that uses raw tabix’ed VCFs + numpy linear algebra (which can be speeded up compiling against BLAST/LAPACK).

Daniel

On Feb 23, 2017, at 5:26 PM, Peter VandeHaar notifications@github.com wrote:

Goncalo says that all you need is r between that variant and each other for your data. (separate for cases and controls?)

Option 1: (I GUESS SO)

Set up an LD server that references the raw data, probably by copying Daniel's HVCF.

Option 2: (NAH)

Pre-compute r for all variant pairs within 300kb from the raw data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/pheweb/issues/45, or mute the thread https://github.com/notifications/unsubscribe-auth/AMPXBGoFrmcvehCMquL90xJ_wEbL0QtPks5rfgeAgaJpZM4MKlN6.

pjvandehaar commented 7 years ago

If some 300kb regions have 10x the average variant density, that could somewhat increase the size of pre-computed correlations. If some have 100x the average (ie, 1% of all variants in 0.01% of the genome), we'll have a problem. Oh well, hopefully that doesn't happen.

Maybe we should only allow conditioning on variants with pval < 1e-4. But if we want to support conditional meta-analysis, then we can't have a restriction like that.

How many variants will there probably be in TOPMed?

abecasis commented 7 years ago

Most variants and regions will be 10^-4 for something.

Densest regions are around HLA genes and MHC on chromosome 6.

If we set this up right, we should only need one covariance table for many traits and variants in one PheWeb.

Sent from my iPhone

On Mar 6, 2017, at 2:45 PM, Peter VandeHaar notifications@github.com wrote:

If some 300kb regions have 10x the average variant density, that could somewhat increase the size of pre-computed correlations. If some have 100x the average (ie, 1% of all variants in 0.01% of the genome), we'll have a problem. Oh well, hopefully that doesn't happen.

Maybe we should only allow conditioning on variants with pval < 1e-4. But if we allow conditional meta-analysis, then we can't have a restriction like that.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

pjvandehaar commented 7 years ago

(While I'm doing this, remember to also use study-specific LD for showing LD in LZ. I'm not sure how we'll handle meta-analysis LD. Perhaps it'd be fun to toggle 1000G vs study-specific LD, &c?)