simonhmartin / genomics_general

General tools for genomic analyses.
343 stars 93 forks source link

Error message in popgenWindows.py #75

Closed DuttaAnik closed 2 years ago

DuttaAnik commented 2 years ago

Hi Simon,

I am running the popgenWindows.py on my .geno file with the following code. But I am getting some errors as followings: python /gxfs_home/cau/suaph275/genomics_general/popgenWindows.py -g Chilense.geno -o Chilense.Fst.Dxy.pi.csv.gz -f phased -w 10000 -m 100 -s 10000 -T 16 -p P_2931 -p P_2747 -p P_2932 -p P_1963 -p P_1958 -p P_4330 -p P_3111 -p P_4107 -p P_4117 --popsFile Pop.info.txt --writeFailedWindows

Starting 16 worker threads

/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/gxfs_home/cau/suaph275/genomics_general/genomics.py:979: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

272 windows queued, 259 results received, 259 results written.
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

540 windows queued, 524 results received, 524 results written.
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/gxfs_home/cau/suaph275/genomics_general/genomics.py:979: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

1243 windows queued, 1233 results received, 1233 results written.
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/gxfs_home/cau/suaph275/anaconda3/envs/py3/lib/python3.10/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/gxfs_home/cau/suaph275/genomics_general/genomics.py:979: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t
/gxfs_home/cau/suaph275/genomics_general/genomics.py:979: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t
/gxfs_home/cau/suaph275/genomics_general/genomics.py:979: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t
/gxfs_home/cau/suaph275/genomics_general/genomics.py:979: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t
/gxfs_home/cau/suaph275/genomics_general/genomics.py:979: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t
/gxfs_home/cau/suaph275/genomics_general/genomics.py:979: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t

1522 windows queued, 1507 results received, 1498 results written.
/gxfs_home/cau/suaph275/genomics_general/genomics.py:979: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t
/gxfs_home/cau/suaph275/genomics_general/genomics.py:979: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t
/gxfs_home/cau/suaph275/genomics_general/genomics.py:979: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t
/gxfs_home/cau/suaph275/genomics_general/genomics.py:979: RuntimeWarning: invalid value encountered in double_scalars
  output["Fst_" + pops[x] + "_" + pops[y]] = output["Fst_" + pops[y] + "_" + pops[x]] = 1 - pi_s/pi_t

I first installed python 3 as there was some problem with the previous python version and then installed numpy as per the requirement. Previously, I was not getting any output as -m was too high (-m 5000). Now I have reduced it to 100 but still getting these messages. Can you please suggest how can I avoid these messages?

Thank you. Regards Anik

simonhmartin commented 2 years ago

Hi Anik, These warnings that many of your windows have only missing data for some individuals. This is not usually common with 10kb windows, unless you have reduced representation data such as RAD-seq data, in which case you should try increasing your window size to something like 100kb. Let me know if that doesn't solve the problem. Simon

DuttaAnik commented 2 years ago

Hi Simon, Thanks for the reply. I have increased the -m to 3000 which was actually -m 10 previously. Now, I have got all the cells filled with "NA" for each parameter. I have whole-genome sequencing data comprising ~70 million SNP with biallelic and monomorphic SNPs. I filtered the dataset at a 50% genotyping call rate in VCFtools and then replaced the missing SNPs coded as . to N. If I use -m 500 then I get values for most of the windows. If I check the output file, then I see that there are very few SNPs in that window for estimating the statistics. If I increase the window, then would it solve the problem or affect the estimates? I have uploaded a snapshot of my output file. Can you please suggest something here?

Screenshot 2022-05-10 at 21 49 51
simonhmartin commented 2 years ago

The issue is not -m it's -w which is the size of the window relative to scaffold contig coordiantes. For example if you set -w 10000 then your first window will be from positions 1 to 10,000 on the scaffold. In your case, there are only 12 sites with genotype data within those 10,000 positions on the scaffold. If you increased to say -w 50000, you would have 153 sites of data to work with in the first window (12+31+27+32+51), so this would give you pi values, but only if you set the minimum low enough, like -m 100.

An alternative approach for when you have very patchy data is to use --windType sites. This way when you set -w 10000 it will use the first 10,000 sites with data for the first window and so on, but note that this will mean your windows will have different lengths on the scaffold. I hope this clarifies things?

DuttaAnik commented 2 years ago

Hi Simon, yes it is clear now. Then, I will proceed with a longer window size. Just a query: as I see it is advisable to use a VCF file with monomorphic and biallelic SNPs without any MAF filter for estimating Pi, does the same apply when estimating Fst and Dxy? Or these two parameters are affected by the presence of low frequency, monomorphic variants?

simonhmartin commented 2 years ago

For dxy, you need the invariants to be present in the vcf, otherwise the values will be far too large. The denominator of dxy is the number of variant and invariant sites present in the window, and my scripts do not include invariant sites that are not included as lines in the vcf, because my scripts cannot "see" these invariants. The scripts do not blindly assume that sites not present in the window are invariant (see a paper on why this is a sensible approach here). Looking back at your data, perhaps you only included variants, and this is why you have so few sites per window - as I see you pi values are very large? By contrast, Fst estimates should not be significantly impacted by excluding invariant sites, since this is a relative measure. Also, dxy measures should still be comparable to one-another, but their absolute value is meaningless without the invariants included.

The MAF filter is much less important. You could choose to exclude variants with very low MAF if you think these are likely errors, but such sites should also have minimal influence on the numerator of pi and dxy, so you may as well include them in my opinion.

DuttaAnik commented 2 years ago

Thanks a lot, Simon, for your elaborate answer. It makes sense now.

simonhmartin commented 2 years ago

Great. You're welcome!