Estimates of t depend on genotype likelihood input format

t-vane commented 1 year ago

Hi,

I would like to use 'distAngsd-geno' for a large isolation-by-distance analysis in lemurs. I am primarily working with genotype likelihoods estimated with ANGSD. Because I was not sure what input file distAngsd supports, I ran it with all four possible outputs that ANGSD can provide: (1) binary all 10 log genotype likelihood, (2) beagle genotype likelihood format, (3) beagle binary, and (4) textoutput of all 10 log genotype likelihoods. Here is my problem: distAngsd runs on all four formats without error but outputs different distances t for each. What input format should I use, and where do these differences originate?

Cheers and thank you for your help, Tobi

TeresaPegan commented 1 year ago

Hi Tobi, I am a distAngsd user rather than a developer, but hopefully this will help! In my experience, distAngsd will always produce an estimate of t, even if the input format does not match what it expects. For example, if you tell distAngsd that your input is not binary and it actually is in binary format, you will get a weird value of t as a result, but no error or warning. Similarly, I recently accidentally tried to run distAngsd on an input file with 3 columns of genotype likelihoods per individual instead of the program's expected 10 columns per individual, and it once again produced a nonsensical estimate of t instead of a warning or error.

Surprisingly (given the program's name), I don't think there is any raw output from ANGSD that will actually work with distAngsd. Instead, you have to make sure that your input matches the format of the test files that come with distAngsd. This means, at the very least, that: -- there can be no columns with chromosome and position -- you need all 10 genotype likelihoods for each individual -- the values produced by ANGSD are log likelihoods and distAngsd expects un-logged likelihoods, so you have to exponentiate the values ANGSD gives you

Here is code I have used to convert some output from ANGSD into something that (I think) distAngsd can use properly. The piped section of the code removes the chr and pos columns, exponentiates (un-logs) the likelihoods, converts them into non-scientific notation, and makes sure the output is tab delimited. I always get a lot of warnings from the exponentiation code for values being "out of range," which comes from values where the resulting number is extremely small -- I believe those end up just having a genotype likelihood of 0.

Good luck! -Teresa


$ANGSDIR/angsd -b $BAMS -out $OUT -doMaf 1 -doGlf 4 -doMajorMinor 1 -anc $ANC \
-minMapQ 30 -minQ 30 -doPost 1 -GL 2 -dosnpstat 1 -nThreads 4 -doCounts 1 \
-doHWE 1 -sites $SITES

GLF=$OUT".glf.gz"
INGLF=$OUT".clean.glf.gz"

zcat $GLF | cut --complement -f1,2 | awk ' OFS = "\t", OFMT="%f" {

                for(i = 1; i <= NF; i++)
                        s = s OFS exp($i)
                print s
                s = ""
}' | awk '{for (i=1; i<=NF; i++) printf "%.6f%s", $i, (i==NF?"\n":" ")}' | awk 'OFS="\t" {$1=$1}1 {print $0}'  > $INGLF

TeresaPegan commented 1 year ago

p.s. I am also trying to use distAngsd for isolation by distance, and it's a bit challenging to deploy it at a large scale for many pairs of individuals! I recommend perusing the other issues on this github page to see some discussion about this (#4, #6)

lz398 / distAngsd

Estimates of t depend on genotype likelihood input format #11