szpiech / lassip

LASSI-Plus: A program to calculate haplotype frequency spectrum statistics
GNU General Public License v3.0
6 stars 2 forks source link

curious results with m value #2

Closed cistarsa closed 3 years ago

cistarsa commented 3 years ago

hello and thanks for running this nifty program! I'm trying to compare these results with RAiSD (hard selective sweep) on two divergent but related populations (n=25 & 28) but for one of these I'm getting a huge amount of "1" value haplotypes.

##running: 
for j in `ls ./LI_vcfs/edit*vcf`; do ./lassip --vcf "$j" --unphased --calc-spec --hapstats --lassi --winsize 201 --winstep 21 --out "$j"_edit --pop LI_bam_25.list; done

./lassip --spectra ./LI_vcfs/edit*LI*spectra.gz --lassi --out LI_spectra_out

## resulting:
pop 1 (n=25) at 583114 windows
1m = 876
2m = 312
3m = 68
4m = 1

## compared to
pop 2 (n = 28) at 560481 windows
1m = 450489
2m = 219
3m = 40
4m = 1
szpiech commented 3 years ago

Wow, interesting. I haven't had a chance to explore the behavior of m generally (i.e., not just in regions with high support for a distorted HFS, very large T). I might guess that is it picking up something going on with the demographic history in pop 2. Does it differ substantially from pop 1 post-dvergence?

cistarsa commented 3 years ago

Indeed it does! These populations have pretty different Ne, heterozygosity, and diverged ~300 generations ago. Perhaps these are artifacts from a bottleneck or founder? Although it could be biologically consistent I guess I'm still surprised to see so so many more in pop 2. Maybe ill run saltilassi

cistarsa commented 3 years ago

Also wondering how it handles missing data?

szpiech commented 3 years ago

At the moment, the software will drop any haplotype that has missing data for that window's HFS computation. With a large sample size, this shouldn't matter much, but if you have a smaller sample size with lots of missing data it may be worth filtering those loci (or imputing) until I come up with a better way to handle it.

One thing to keep in mind is that as the window size grows larger (when there is missing data present) the more likely you are to encounter at least one missing allele on each haplotype in that window, and then the computation will fail in that window for lack of data.

cistarsa commented 3 years ago

good to keep in mind, thank you for the quick and helpful feedback!