lz398 / distAngsd

8 stars 1 forks source link

interpretation of estimated t > 1 and t = 9.99999, and relationship between effective sites and t #8

Open TeresaPegan opened 1 year ago

TeresaPegan commented 1 year ago

Hello!

I am hoping to compare genetic distance and geographic distance with pairs of individuals within a population. In this sense I would like to use distAngsd in the same way that people sometimes use ngsDist, but hopefully without the biases ngsDist produces (as demonstrated in your paper!) However, using distAngsd in this manner I have a couple of unexpected patterns in my results that I am not sure how to interpret, but that seem to reflect something wrong happening. Thanks very much for any thoughts about interpreting these patterns! I am excited about using this program on my data, once I have made sure I am using it and interpreting it properly.

As a brief note about my data, these are pairs of individual birds (812 pairs total) from a single species that I expect to be fairly closely related. They are sequenced at about 3x on average (although there is a distribution of variation from something like 1x to 10x). Sequences were aligned to a reference genome (not the same species, but a relative) and I am using data from one fairly large chromosome (62 mb) for distAngsd. I use ANGSD to create separate glf files consisting of only SNPs in each pair of individuals. You will see my range of effective sites numbers in a plot below.

Here are my unexpected observations:

1) a bunch of my pairs of individuals end up with t = 9.99999. I suspect this reflects some kind of error or other problem, because the last time I got this result was when I mistakenly tried to run the program on a txt file with the setting -inbin 1. In this case, do you have any idea what might be causing the problem? Under what conditions does the program return t = 9.99999?

2) a large number of my pairs of individuals end up with t > 1, even those that aren't at t = 9.99999. I was surprised to see this as I had expected genetic distances to be much less than 1 in general. I am attaching three histograms here: one of all genetic distance values I obtained, one showing all values other than those at 9.99999, and one showing all values less than 1. These demonstrate that a large proportion of my pairs show t > 1, and that among pairs with t < 1 I have many pairs along the entire distribution of possible values. image

3) there is a relationship between the "Effective number of sites" and the resulting t value. This is shown in the plot below. It is also clear that pairs with relatively low effective sites are more likely to end up with t = 9.99999. What is the meaning of "Effective number of sites?" I assume this number comes from the number of SNPs provided in the GLF file? This value ranges from about 800 thousand to about 2 million among my pairs. image

Here is code I used to create these results. It is run in a SLURM array job with one job for each pair of individuals. The $BAMS file contains the paths to bams of the two individuals in the pair. I am using ANGSD 0.937.

$ANGSDIR/angsd -b $BAMS -out $OUT -doMaf 1 -doGlf 4 -doMajorMinor 1 -anc $ANC \
-minMapQ 30 -minQ 30 -doPost 1 -GL 2 -dosnpstat 1 -nThreads 4 \
-doHWE 1 -SNP_pval 0.05 $CHRCMD 

GLF=$OUT".glf.gz"
INGLF=$OUT".clean.glf.gz"

zcat $GLF | cut --complement -f1,2 | gzip > $INGLF

$DISTDIR/distAngsd -inglf $INGLF -o $DISTLOG -inbin 0 -outbin 0 > $DISTOUT

This produces one output file for each of my 812 pairs, which I parsed to pull out the effective sites and t values.