szpiech / lassip

LASSI-Plus: A program to calculate haplotype frequency spectrum statistics
GNU General Public License v3.0
6 stars 2 forks source link

A couple questions on running saltiLassi for multiple populations #7

Open James-S-Santangelo opened 4 weeks ago

James-S-Santangelo commented 4 weeks ago

Hey Zachary,

I'm hoping you can help me think through an analysis I'm working on. I have two populations, which we can assume are panmictic. I have run cross-population stats (e.g., XP-nSL, Fst) and single population stats (e.g., nSL, iHH12, iHS, saltiLassi), and have found multiple regions with signatures of positive selection in each population based on these stats. Broadly, I'm interested in a first pass characterization and comparison of the sweep architectures in each of these two populations. Here are my questions:

  1. Is there any way to compare the haplotype frequency spectra between the two populations? I have run saltiLassi in 201 SNP windows with a step of 50 and K set to 20. However, since lassip is run on each population independently, my sense is that Haplotype 1 in pop 1 is not the same as haplotype 1 in pop 2, so raw comparisons of the haplotype frequency spectra in a given region between these two populations would be misleading.

  2. In each population, I have estimates of m (i.e., number of sweeping haplotypes) and A (i.e., width of sweeps) for putatively selected regions of the genome. I'm interested in comparing m and A between these populations to make broad statements about differences in hard (m = 1) vs. soft (m > 1) sweeps between these two populations, and similarly comparing the width of sweeps (A) between these populations. I was planning on binning m into m = 1 and m > 1 (as you suggest in the paper) and probably just doing a simple Chi-squared test on the frequency of hard vs. soft sweeps between the two populations. Again, this is just a first pass and is mostly meant to stimulate discussion and suggest avenues for future work. Does this seem reasonable, or is there any reason you can think of that such an approach would misleading or downright wrong?

Thanks in advance!

James

szpiech commented 3 weeks ago

Hi James,

So, I think your best bet for (1) is to use the coordinates that define the windows you're interested in, intersect them so they align maximally, and compute inter-group pairwise sequence distance. If it is "low" they might be the same haplotypes sweeping, if it is "high" they may be different. I suppose you could try to define low and high based on some sort of gnome-wide resampling procedure, but you'd have to design it carefully.

For (2), your chi-squared approach seems reasonable to me.

-Zachary

On Wed, Jun 19, 2024 at 4:19 PM James Santangelo @.***> wrote:

Hey Zachary,

I'm hoping you can help me think through an analysis I'm working on. I have two populations, which we can assume are panmictic. I have run cross-population stats (e.g., XP-nSL, Fst) and single population stats (e.g., nSL, iHH12, iHS, saltiLassi), and have found multiple regions with signatures of positive selection in each population based on these stats. Broadly, I'm interested in a first pass characterization and comparison of the sweep architectures in each of these two populations. Here are my questions:

1.

Is there any way to compare the haplotype frequency spectra between the two populations? I have run saltiLassi in 201 SNP windows with a step of 50 and K set to 20. However, since lassip is run on each population independently, my sense is that Haplotype 1 in pop 1 is not the same as haplotype 1 in pop 2, so raw comparisons of the haplotype frequency spectra in a given region between these two populations would be misleading. 2.

In each population, I have estimates of m (i.e., number of sweeping haplotypes) and A (i.e., width of sweeps) for putatively selected regions of the genome. I'm interested in comparing m and A between these populations to make broad statements about differences in hard (m = 1) vs. soft (m > 1) sweeps between these two populations, and similarly comparing the width of sweeps (A) between these populations. I was planning on binning m into m = 1 and m > 1 (as you suggest in the paper) and probably just doing a simple Chi-squared test on the frequency of hard vs. soft sweeps between the two populations. Again, this is just a first pass and is mostly meant to stimulate discussion and suggest avenues for future work. Does this seem reasonable, or is there any reason you can think of that such an approach would misleading or downright wrong?

Thanks in advance!

James

— Reply to this email directly, view it on GitHub https://github.com/szpiech/lassip/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQUSNR5X4MYRKBCPLG3ZIHRTVAVCNFSM6AAAAABJSU3AC2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3DGMBWHE2TOMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>