Open imengyuan opened 3 years ago
Hi, These issues are probably related to missing genotypes in your input file.
Missing data can take two forms. First, you could have missing genotypes in certain individuals, encoded as N/N
. Second, you could have entire sites absent from the .geno file.
Missing data in itself is not a bad thing. In fact it is often good to use stringent filters and only keep only high confidence genotypes. But this comes at a cost in the amount of information available to do the calculations.
The main way to control how to deal with missing data is with the -m
(--minSites
) option. Any window with less than this number of sites present in the geno file will simply be ignored. You can see the number of sites that were present for each window in column 5 of the output file. If you add the --writeFailedWindows
option, these windows with too few sites will be included in the output with all result values set to nan
. In your example, the two failed windows had zero sites, so there was simply no information available for those genes. It could be that your initial filters on your vcf file were too stringent, or that read mapping was very poor in these regions. I recommend having a look at these regions in the bam files (e.g. using IGV)
Sometimes the geno file will contain sites where many (or all) individuals have missing genotypes. In other words, the site is present in the input file, but most or all of the genotype information at that site are absent (N/N
). Sites like this are not a problem if they are rare, because the script will use any information that is present, even if only present in two individuals. However, if sites like this are common in the input geno file, then the --minSites
cutoff is not very useful. For example, a site that is present in the input file but has missing data in all individuals will be counted as contributing to the window, but will actually not provide any information. It is therefore recommended to at least filter out sites without any good genotypes before running popgenWindows.py.
Finally, the warnings you are getting probably relate to cases in which one or more of the individuals being analysed has no data at all for the given window, such that values like its genetic distance to other individuals cannot be computed, and numpy returns a warning. The script will then simply ignore this individual when computing statistics, so this is not a major cause for concern. If all individuals in the file (or in a population) have missing data at all sites in the window, then the script will return nan
for all statistics for that window. Getting many warnings like this may indicate that you should reconsider your filtering of the input file and your --minSites
value.
Hi, imengyuan. The same problem (nan) appeared in my result file. How did you solve it?
Hi Simon, thanks for the useful scripts! I'm having some issues running
popgenWindows.py
with predefined windows, out of the ~43k genes in the.bed
file, the output only writes 19 results:My command is
python3 popgenWindows.py --windType predefined --windCoords genes.bed -g all.geno.gz -o pi_genes.csv.gz -f phased -T 10 -p TX -p NC --popsFile popsample.txt
The warning messages in the output are:
Some of the results also have empty values:
Adding
--writeFailedWindows
to the command, I have lots of this:If I run the same command using a different annotation file with more conserved genes (.bed), the output gives the same warning messages as above but generates more results. I didn't see any obvious differences between the two
.bed
files other than containing different genes.I wonder if you have suggestions to work around this issue of having too few results? Or it's my
.bed
file causing the issue? Thanks.