szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
107 stars 33 forks source link

XP-EHH phased vs unphased #86

Open jaholliday opened 1 year ago

jaholliday commented 1 year ago

I have a dataset of ~10M sites per chromoromse and ~40 samples per 'population'. I ran the --unphased version, parallelized per chromosome, and it finished in a few days. The first (smallest) chromosome finished within about 36hrs. The same analysis with the phased option has not finished any chromosome after ~4days.

Our wall time limit is 6 days so I'm wondering what's likely to be the most powerful analysis: 1) Just go with unphased 2) Thin VCF and re-run phased analysis 3) Break up chromosomes into smaller chunks and further parallelize phased analysis

Thanks! Jason

szpiech commented 1 year ago

Hi Jason,

Hmm, I have noticed xp-ehh runs somewhat slowly, especially when using —pmap. Not sure why —pmap makes a difference, but the xp stats use all variants (filtering my MAF appears to hurt power in my tests) and the computation scales ~O(N^2) in variants. I assume you are also running on the cluster with multiple cores and —threads set to the appropriate number? I typically run on a node with 20 reserved cores, so I use —threads 20 to speed things up.

You might also consider xp-nsl, which seems to run faster as well.

If you still need to improve run time, you could split the chromosomes up, but I would recommend leaving some overlap of at least 1Mbp because otherwise scores near the breakpoints will be a little inaccurate (due to losing signal).

Let me know if this helps or if you have more questions.

Zachary

Le mar. 19 juil. 2022 à 11:29 AM, jaholliday @.***> a écrit :

I have a dataset of ~10M sites per chromoromse and ~40 samples per 'population'. I ran the --unphased version, parallelized per chromosome, and it finished in a few days. The first (smallest) chromosome finished within about 36hrs. The same analysis with the phased option has not finished any chromosome after ~4days.

Our wall time limit is 6 days so I'm wondering what's likely to be the most powerful analysis:

  1. Just go with unphased
  2. Thin VCF and re-run phased analysis
  3. Break up chromosomes into smaller chunks and further parallelize phased analysis

Thanks! Jason

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/86, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQS67YI3GRU2NZUEUYDVU3CVZANCNFSM54ANDF7Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jaholliday commented 1 year ago

Thanks for the reply. I've been giving each chromosome 20 threads. I'll test out giving it more and also try xp-nsl.

I get very clear signal from the --unphased xp-ehh run but am wondering if it's too good to be true and may reflect some bias I haven't accounted for unrelated to selscan.

szpiech commented 1 year ago

Hi Jason,

Did I see these results on twitter a few days ago? There were certainly some extraordinary peaks, if I recall correctly. I've never seen normalized xp-ehh scores so strong before, but I haven't worked on a huge diversity of species. Can you tell me a bit about your analyses?

Zachary

On Fri, Jul 22, 2022 at 8:46 PM jaholliday @.***> wrote:

Thanks for the reply. I've been giving each chromosome 20 threads. I'll test out giving it more and also try xp-nsl.

I get very clear signal from the --unphased xp-ehh run but am wondering if it's too good to be true and may reflect some bias I haven't accounted for unrelated to selscan.

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/86#issuecomment-1193024212, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQQ2GVBOATYW26T7AZLVVM6HXANCNFSM54ANDF7Q . You are receiving this because you commented.Message ID: @.***>

jaholliday commented 1 year ago

I got a message that maybe the figures were too big, so here they are as dropbox links

https://www.dropbox.com/s/a9ab5babifr78rq/Screen%20Shot%202022-10-11%20at%209.28.15%20AM.png?dl=0 https://www.dropbox.com/s/a9ab5babifr78rq/Screen%20Shot%202022-10-11%20at%209.28.15%20AM.png?dl=0

https://www.dropbox.com/s/vy5uc9077lfgvwl/Screen%20Shot%202022-10-11%20at%209.29.27%20AM.png?dl=0 https://www.dropbox.com/s/vy5uc9077lfgvwl/Screen%20Shot%202022-10-11%20at%209.29.27%20AM.png?dl=0

————————— Jason Holliday Professor Department of Forest Resources and Environmental Conservation Virginia Tech 451 Latham Hall Blacksburg VA 24061 540 231 7267 (Office) 540 553 4624 (Cell) https://virginiatech.zoom.us/j/4443630815

On Oct 11, 2022, at 9:50 AM, Jason Holliday @.***> wrote:

Hi Zachary,

Thanks for your message. Yah, I’m a little suspicious of this result and would be glad to get your opinion on it. We are using a variety of data (QTL, RNA-seq, comparative genomes) to try to find candidate genes for chestnut blight resistance. American chestnut was wiped out a century ago by a fungal pathogen native to Asia, where there are multiple resistant congeners. There is a second pathogen called Phytophthora that also is introduced and kills American chestnut, but to which Asian species are resistant. I thought one layer of evidence to find the genes would be regions under selection in one or more of the Asian species but not in American chestnut.

I ran XP-EHH in unphased mode for two comparisons: Chinese vs American chestnut and Japanese vs. American chestnut. All data were generated from short read WGS (~15X) aligned to the American chestnut reference genome. Below see a couple plots. The first has two tracks reflecting XP-EHH for those two tests above (Japanse-American and Chinese-American). Note that both give strong and partially overlapping signals (XP-EHH values were truncated at >3 for quicker plotting).

The second (double) plot compares the same (Chinese-American) XP-EHH scores (inner track) with QTL intervals/LOD scores (outer track) for blight and Phytopthora resistance in hybrid/backcross chestnut populations (Chinese x American followed by backcrossing to American). I haven’t quantified this but it sure looks like there is non-random overlap between XP-EHH signals and QTL intervals. These figures were for a talk so the arrows were just for illustration/examples. Note that the dotted lines were drawn wherever XP-EHH was above some threshold (I think >8) and are not a function of the QTL peaks. Also note that blight has complex genetics (lots of QTL), but Phytophthora is fairly simple (one big QTL on chr05, couple smaller ones elsewhere).

If you have suggestions to exclude possible artifacts let me know. I’m under no illusion that these are definitely real, and need to look more carefully at read mapping and possible SVs or repeats in these regions. But I’ve checked a few and they do not stand out in those aspects. These species are quite diverged (millions of years) but the genomes are highly syntenic and blight came on the scene in Asia much more recently than the North American-Asian chestnut species split, and likely has been an ongoing co-evolutionary selection pressure in Asia. Of course, there are surely plenty of other traits that differentiate the Asian and North American species, and so we’d expect signals unrrelated to disease resistance.

Cheers Jason <Screen Shot 2022-10-11 at 9.28.15 AM.png>

<Screen Shot 2022-10-11 at 9.29.27 AM.png> ————————— Jason Holliday Professor Department of Forest Resources and Environmental Conservation Virginia Tech 451 Latham Hall Blacksburg VA 24061 540 231 7267 (Office) 540 553 4624 (Cell) https://virginiatech.zoom.us/j/4443630815 https://virginiatech.zoom.us/j/4443630815

On Oct 11, 2022, at 3:08 AM, Zachary A Szpiech @. @.>> wrote:

Hi Jason,

Did I see these results on twitter a few days ago? There were certainly some extraordinary peaks, if I recall correctly. I've never seen normalized xp-ehh scores so strong before, but I haven't worked on a huge diversity of species. Can you tell me a bit about your analyses?

Zachary

On Fri, Jul 22, 2022 at 8:46 PM jaholliday @.***> wrote:

Thanks for the reply. I've been giving each chromosome 20 threads. I'll test out giving it more and also try xp-nsl.

I get very clear signal from the --unphased xp-ehh run but am wondering if it's too good to be true and may reflect some bias I haven't accounted for unrelated to selscan.

— Reply to this email directly, view it on GitHub <https://github.com/szpiech/selscan/issues/86#issuecomment-1193024212 https://github.com/szpiech/selscan/issues/86#issuecomment-1193024212>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABAKRQQ2GVBOATYW26T7AZLVVM6HXANCNFSM54ANDF7Q https://github.com/notifications/unsubscribe-auth/ABAKRQQ2GVBOATYW26T7AZLVVM6HXANCNFSM54ANDF7Q> . You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/86#issuecomment-1274187060, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGAMPGIZ5TPEVWCIZPP4HDWCUG6DANCNFSM54ANDF7Q. You are receiving this because you authored the thread.