Closed DuttaAnik closed 2 years ago
Hi Anik, These warnings that many of your windows have only missing data for some individuals. This is not usually common with 10kb windows, unless you have reduced representation data such as RAD-seq data, in which case you should try increasing your window size to something like 100kb. Let me know if that doesn't solve the problem. Simon
Hi Simon,
Thanks for the reply. I have increased the -m
to 3000
which was actually -m 10
previously. Now, I have got all the cells filled with "NA" for each parameter. I have whole-genome sequencing data comprising ~70 million SNP with biallelic and monomorphic SNPs. I filtered the dataset at a 50% genotyping call rate in VCFtools and then replaced the missing SNPs coded as .
to N
.
If I use -m 500
then I get values for most of the windows. If I check the output file, then I see that there are very few SNPs in that window for estimating the statistics. If I increase the window, then would it solve the problem or affect the estimates?
I have uploaded a snapshot of my output file.
Can you please suggest something here?
The issue is not -m
it's -w
which is the size of the window relative to scaffold contig coordiantes. For example if you set -w 10000
then your first window will be from positions 1 to 10,000 on the scaffold. In your case, there are only 12 sites with genotype data within those 10,000 positions on the scaffold. If you increased to say -w 50000
, you would have 153 sites of data to work with in the first window (12+31+27+32+51), so this would give you pi values, but only if you set the minimum low enough, like -m 100
.
An alternative approach for when you have very patchy data is to use --windType sites
. This way when you set -w 10000
it will use the first 10,000 sites with data for the first window and so on, but note that this will mean your windows will have different lengths on the scaffold. I hope this clarifies things?
Hi Simon, yes it is clear now. Then, I will proceed with a longer window size. Just a query: as I see it is advisable to use a VCF file with monomorphic and biallelic SNPs without any MAF filter for estimating Pi, does the same apply when estimating Fst and Dxy? Or these two parameters are affected by the presence of low frequency, monomorphic variants?
For dxy, you need the invariants to be present in the vcf, otherwise the values will be far too large. The denominator of dxy is the number of variant and invariant sites present in the window, and my scripts do not include invariant sites that are not included as lines in the vcf, because my scripts cannot "see" these invariants. The scripts do not blindly assume that sites not present in the window are invariant (see a paper on why this is a sensible approach here). Looking back at your data, perhaps you only included variants, and this is why you have so few sites per window - as I see you pi values are very large? By contrast, Fst estimates should not be significantly impacted by excluding invariant sites, since this is a relative measure. Also, dxy measures should still be comparable to one-another, but their absolute value is meaningless without the invariants included.
The MAF filter is much less important. You could choose to exclude variants with very low MAF if you think these are likely errors, but such sites should also have minimal influence on the numerator of pi and dxy, so you may as well include them in my opinion.
Thanks a lot, Simon, for your elaborate answer. It makes sense now.
Great. You're welcome!
Hi Simon,
I am running the
popgenWindows.py
on my.geno
file with the following code. But I am getting some errors as followings:python /gxfs_home/cau/suaph275/genomics_general/popgenWindows.py -g Chilense.geno -o Chilense.Fst.Dxy.pi.csv.gz -f phased -w 10000 -m 100 -s 10000 -T 16 -p P_2931 -p P_2747 -p P_2932 -p P_1963 -p P_1958 -p P_4330 -p P_3111 -p P_4107 -p P_4117 --popsFile Pop.info.txt --writeFailedWindows
I first installed python 3 as there was some problem with the previous python version and then installed numpy as per the requirement. Previously, I was not getting any output as
-m
was too high (-m 5000
). Now I have reduced it to 100 but still getting these messages. Can you please suggest how can I avoid these messages?Thank you. Regards Anik