jgx65 / hierfstat

the hierfstat package
24 stars 14 forks source link

Calculating FST with Dosage and Missing Data #54

Closed alexkrohn closed 2 years ago

alexkrohn commented 2 years ago

Hi @jgx65,

I've been using hierfstat to calculate basic.stats and genetic distances, including WC Fst from VCFs for a while now. I'm using RADseq datasets. Given that the datasets often have a lot of missing data, I calculate the statistics at a variety of missing data thresholds to see how missing data affects the statistics.

Some of my datasets contain related individuals, so I would like to try the dosage based Fst calculation from your 2017 paper, especially pairwise individual or population-level Fsts. However, I run into problems with loci that have missing data. Is there any way around this?

In general my workflow looks like this:

my.vcf <- vcfR::read.vcfR("path/to/vcf.vcf")
genind <- vcfR2genind(my.vcf)
pop.names <- vector.of.populations.that.correspond.to.individuals

df <- genind2hierfstat(genind, pop = pop.names)

basic.stats(df)
genet.dist(df)

This works even with up to 70% missing data (although that is obviously not ideal, and the inferences are not very believable).

However, this does not work:

dos <- biall2dos(df, diploid = TRUE)

# Error in FUN(X[[i]], ...) : 
#  only defined on a data frame with all numeric-alike variables

Is there an easy way to convert genind/vcf or hierstat dfs to dos? I'm happy to send over smaller datasets for you to examine more closely.

jgx65 commented 2 years ago

Hi @alexkrohn

You can read VCF directly in hierfstat using the read.VCF function. The bed object created can be used directly with beta.dosage to obtain kinship and inbreeding coefficients, or fs.dosage to obtain FST, pairwise FSTs, population specific FSTs etc...

dat<-read.VCF("my.vcf.file.vcf[.gz]")
kin.dat<-beta.dosage(dat)
pop=vector.of.populations.that.correspond.to.individuals
fs.dat<-fs.dosage(dat,pop=pop)
plot(fs.dat)
jgx65 commented 2 years ago

Still, biall2dos should work, try dos<-biall2dos(df[,-1]) (to exclude the first column containing the population identifier in df)

alexkrohn commented 2 years ago

Importing with read.VCF worked! Thank you! Something must've gotten messed up in all the conversions.

biall2dos(df[,-1]) still doesn't work, giving the same error about non-numeric values.

You've resolved my issue, so I'll close this comment. Thanks for your help.

....While I have you here, though (poor Github form, I'm sure), I was wondering what you think about the best way to get Ho and Hs for a population. Do you think it's sufficient to take the mean Ho and Hs across all loci at a given population?

E.g.:

stats <- basic.stats(df)
ho <- apply(stats$Ho, 2, function(x){mean(x,na.rm=TRUE)})
hs <- apply(stats$Hs, 2, function(x){mean(x,na.rm=TRUE)})

Or is there a better way? I've found that doing this for Fis doesn't work well.

For example:

fis1 <- apply(stats$Fis, 2, function(x){mean(x,na.rm=TRUE)})
fis2 <- 1 - (ho/hs)

fs.dat <- fs.dosage(dat, pop = pop
fis3 <- fs.dat$Fs[1,]

Yields three slightly different values for Fis.

Any suggestions?

jgx65 commented 2 years ago

There are functions Hs and Ho in hierfstat, they do just that:

 library(hierfstat)
data(gtrunchier)
Hs(gtrunchier[,-2])
         1          2          3          4          5 
0.56571667 0.03776667 0.38731667 0.26976667 0.43050000 
         6 
0.36266667 
Ho(gtrunchier[,-2])
        1         2         3         4         5         6 
0.0859000 0.0000000 0.0243000 0.0463000 0.0669500 0.1958333 

This should be identical/ almost identical to your ho and hs. As for Fis, fis1 is an average of ratios, with poor properties. fis2 and fis3 should be very close, fis3 is the average of the individuals inbreeding coefficients in each population.

Hope this helps.

alexkrohn commented 2 years ago

This is perfect. Thanks very much for your advice.