jgx65 / hierfstat

the hierfstat package
24 stars 14 forks source link

Treatment of populations of 1 #68

Open sbreitbart opened 7 months ago

sbreitbart commented 7 months ago

Hello,

I've computed Fst among groups with both hierfstat's varcomp.glob() and vcftools --weir-fst-pop. Each give different answers even though both use the WC formula. I'm wondering if the reason behind the discrepancy is the way each software treats populations with 1 individual (no within-population variance). Does hierfstat remove them from the dataset?

Thank you, Sophie

jgx65 commented 7 months ago

Hi Sophie, could you provide a worked-out example?

Also, if you are using a one-level hierarchy as it seems, I suggest you use the wc function rather than varcomp.glob.

Last, we are now advocating using the Weir-Goudet estimator of FST, using the fs.dosage function, see Weir & Goudet 2017 and Goudet & Weir 2023

Best wishes

sbreitbart commented 7 months ago

Hi there,

Thank you kindly for your prompt response! Here's my sampling scheme: individuals within populations within urban/rural groups within the entire group of all sampled individuals.

I start off creating a genind from my vcf with vcfR2genind.

my_genind

/// GENIND OBJECT /////////

 // 256 individuals; 972 loci; 1,944 alleles; size: 2.5 Mb

 // Basic content
   @tab:  256 x 1944 matrix of allele counts
   @loc.n.all: number of alleles per locus (range: 2-2)
   @loc.fac: locus factor for the 1944 columns of @tab
   @all.names: list of allele names for each locus
   @ploidy: ploidy of each individual  (range: 2-2)
   @type:  codom
   @call: adegenet::df2genind(X = t(x), sep = sep)

 // Optional content
   @pop: population of each individual (group size range: 1-5)
   @strata: a data frame with 3 columns ( sample, population, urban_rural )

The strata (note that some populations contain only one individual)

head(my_genind$strata)
           sample population urban_rural
1 1.2_1A10_sorted         67       Urban
2 1.2_1A11_sorted         79       Urban
3  1.2_1A1_sorted         40       Urban
4  1.2_1A2_sorted         47       Rural
5  1.2_1A3_sorted          2       Urban
6  1.2_1A4_sorted         12       Urban

# create hierfstat-friendly genind
formatted_genind <- genind2hierfstat(my_genind)

head(formatted_genind[, 1:3])
                pop X102.87.. X107.21..
1.2_1A10_sorted  67         0         0
1.2_1A11_sorted  79         0         0
1.2_1A1_sorted   40         0         0
1.2_1A2_sorted   47         0         0
1.2_1A3_sorted    2         0         0
1.2_1A4_sorted   12         0         0

# estimate var comps and hierarchical fsts
fsts <- varcomp.glob(levels = data.frame(pops$urban_rural,
                                 pops$population),
             loci = formatted_genind[,-1],
             diploid = TRUE)

The fsts$F table looks like this when exported:

Level Urban/Rural Group Sampling Site Individual
Total 0.0005 0.0798 0.1637
Urban/Rural Group - 0.0794 0.1633
Sampling Site - - 0.0912

I then perform this process with vcfs containing individuals from just urban and rural groups. For example, this is for the urban group:

fsts_urb <- varcomp.glob(levels = data.frame(pops_urb$population),
             loci = formatted_genind_urb[,-1],
             diploid = TRUE)

Then, I compare them to my results from vcftools' --weir-fst-comp and see they're pretty different, but I don't know why. If anything, I'd think the hierfstat results make more sense than the vcftools' results. Urban: F(sampling site/total): vcftools = 0.079; hierfstats = -0.003. Rural: F(sampling site/total): vcftools = 0.049; hierfstats = 0.005.

jgx65 commented 7 months ago

Thanks Sophie,

How many rural and urban sites do you have? From your output, F[sampling-site-UR group] is 0.0794, very close to what vcftools is giving for sampling-site-Total Urban, and not far from sampling site-total Rural, whereas hierfstat gives very different results.Your syntax looks correct, but I'm not sure what formatted_genind_urb contains, nor pops_urb$population. Could the indices have been mixed up?

perhaps you could try out the following?

formatted_genind_urb<-formatted_genind[my_genind$strata$urban_rural=="Urban",]
wc(formatted_genind_urb)
formatted_genind_rur<-formatted_genind[my_genind$strata$urban_rural=="Rural",]
wc(formatted_genind_rur)
sbreitbart commented 7 months ago

Hi Jerome,

Hmm- perhaps I did make an error as you suggested. Regardless, since my last message, my team and I have decided to seek a way to calculate Fst using Bhatia's/Hudson's estimator. Do you have any advice for doing this with hierfstat or another R package? Preferably using a vcf as input? Also, I started troubleshooting fs.dosage() but was unsure how to convert my vcf into the right input matrix format.

Thanks again for your continued help!

jgx65 commented 7 months ago

to read a vcf in hierfstat, use the function read.VCF. The object you obtained is a bed object from package gaston, and you can use it directly with fs.dosage , you need to pass the population identifier of the different individuals as a second argument to the function

jgx65 commented 7 months ago

to read a vcf in hierfstat, use the function read.VCF. The object you obtained is a bed object from package gaston, and you can use it directly with fs.dosage , you need to pass the population identifier of the different individuals as a second argument to the function

sbreitbart commented 7 months ago

Hi Jerome- I was able to use the fs.dosage() command successfully for my individual sampling sites- thank you so much!

However, to confirm, I cannot calculate genetic differentiation among the two groups at the highest level (urban and rural) while accounting for the substructure, correct? Again, my data contains individuals within sampling sites within urban/rural groups, like this, and I'm searching for a way to calculate hierarchical FST with a method that accounts for differences in sampling size among groups.

> head(pops)
           sample sampling_site urban_rural_group
1 1.2_1A10_sorted         67       Urban
2 1.2_1A11_sorted         79       Urban
3  1.2_1A1_sorted         40       Urban
4  1.2_1A2_sorted         47       Rural
5  1.2_1A3_sorted          2       Urban
6  1.2_1A4_sorted         12       Urban
jgx65 commented 7 months ago

A hierarchical version of fs.dosage has not been implemented yet, (I am not sure what extra information it would bring, but am willing to reconsider my position). From the analysis with varcomp.glob you provide above it looks like there is almost no differences between urban and rural sites?

sbreitbart commented 7 months ago

Yes, it does look like there is almost no difference between urban/rural groups with the varcomp.glob results; however, my concern is that this function still uses Weir-Cockerham estimator, correct? I need to use an estimator that accounts for differences in sampling sizes. Do you know how I can calculate hierarchical f-statistics using such an estimator with hierfstat?

jgx65 commented 7 months ago

No other ways than varcomp.glob to do hierarchical for the time being. But if you are worried about sample size effect (I don't think it will matter with such a low FST), you could try downsampling your data set to equalize sample sizes. You could do this several time to see if it affects the values of the between habitat FST