simonhmartin / genomics_general

General tools for genomic analyses.
339 stars 93 forks source link

popgenWindows.py most predefined windows don't have results RuntimeWarning #54

Open imengyuan opened 3 years ago

imengyuan commented 3 years ago

Hi Simon, thanks for the useful scripts! I'm having some issues running popgenWindows.py with predefined windows, out of the ~43k genes in the .bed file, the output only writes 19 results:

42983 windows queued, 42983 results received, 16 results written.
Waiting for all threads to finish
42994 windows were tested.
19 results were written.
Done.

My command is python3 popgenWindows.py --windType predefined --windCoords genes.bed -g all.geno.gz -o pi_genes.csv.gz -f phased -T 10 -p TX -p NC --popsFile popsample.txt

The warning messages in the output are:

Starting 10 worker threads
/usr/local/lib64/python3.6/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/usr/local/lib64/python3.6/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib64/python3.6/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/usr/local/lib64/python3.6/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib64/python3.6/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/usr/local/lib64/python3.6/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib64/python3.6/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/usr/local/lib64/python3.6/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib64/python3.6/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/usr/local/lib64/python3.6/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/ohta1/meng.yuan/apps/genomics_general/genomics.py:973: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t
/usr/local/lib64/python3.6/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/usr/local/lib64/python3.6/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib64/python3.6/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/usr/local/lib64/python3.6/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/lib64/python3.6/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/usr/local/lib64/python3.6/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

Some of the results also have empty values:

scaffold,start,end,mid,sites,pi_TX,pi_NC,dxy_TX_NC,Fst_TX_NC
10044,42974,45965,44487,2860,0.0035,0.0037,0.0056,0.2104
10045,33515,35415,34475,1805,nan,nan,nan,nan

Adding --writeFailedWindows to the command, I have lots of this:

scaffold,start,end,mid,sites,pi_TX,pi_NC,dxy_TX_NC,Fst_TX_NC
SCNBKXS_15939-HRSCAF-18845,5965,7726,nan,0,nan,nan,nan,nan
SCNBKXS_15939-HRSCAF-18845,7749,10067,nan,0,nan,nan,nan,nan

If I run the same command using a different annotation file with more conserved genes (.bed), the output gives the same warning messages as above but generates more results. I didn't see any obvious differences between the two .bed files other than containing different genes.

2032 windows were tested.
1467 results were written.

I wonder if you have suggestions to work around this issue of having too few results? Or it's my .bed file causing the issue? Thanks.

simonhmartin commented 3 years ago

Hi, These issues are probably related to missing genotypes in your input file.

Missing data can take two forms. First, you could have missing genotypes in certain individuals, encoded as N/N. Second, you could have entire sites absent from the .geno file.

Missing data in itself is not a bad thing. In fact it is often good to use stringent filters and only keep only high confidence genotypes. But this comes at a cost in the amount of information available to do the calculations.

The main way to control how to deal with missing data is with the -m (--minSites) option. Any window with less than this number of sites present in the geno file will simply be ignored. You can see the number of sites that were present for each window in column 5 of the output file. If you add the --writeFailedWindows option, these windows with too few sites will be included in the output with all result values set to nan. In your example, the two failed windows had zero sites, so there was simply no information available for those genes. It could be that your initial filters on your vcf file were too stringent, or that read mapping was very poor in these regions. I recommend having a look at these regions in the bam files (e.g. using IGV)

Sometimes the geno file will contain sites where many (or all) individuals have missing genotypes. In other words, the site is present in the input file, but most or all of the genotype information at that site are absent (N/N). Sites like this are not a problem if they are rare, because the script will use any information that is present, even if only present in two individuals. However, if sites like this are common in the input geno file, then the --minSites cutoff is not very useful. For example, a site that is present in the input file but has missing data in all individuals will be counted as contributing to the window, but will actually not provide any information. It is therefore recommended to at least filter out sites without any good genotypes before running popgenWindows.py.

Finally, the warnings you are getting probably relate to cases in which one or more of the individuals being analysed has no data at all for the given window, such that values like its genetic distance to other individuals cannot be computed, and numpy returns a warning. The script will then simply ignore this individual when computing statistics, so this is not a major cause for concern. If all individuals in the file (or in a population) have missing data at all sites in the window, then the script will return nan for all statistics for that window. Getting many warnings like this may indicate that you should reconsider your filtering of the input file and your --minSites value.

yangwukaidi commented 1 year ago

Hi, imengyuan. The same problem (nan) appeared in my result file. How did you solve it?