lemieuxl / pyGenClean

Automated genetic data clean up procedure in Python.
GNU General Public License v3.0
3 stars 1 forks source link

MemoryError when reading TPED #27

Closed jgrundstad closed 5 years ago

jgrundstad commented 5 years ago

I'm getting the following error when running the duplicated_samples process, having difficulty tracking down the cause. There are 1,448 samples and 870,358 markers, and running on a machine with 250g ram:

[2018-12-03 16:10:27 duplicated_samples INFO] Options used:
[2018-12-03 16:10:27 duplicated_samples INFO]   --sample-concordance-threshold 0.97
[2018-12-03 16:10:27 duplicated_samples INFO]   --tfile plink
[2018-12-03 16:10:27 duplicated_samples INFO]   --sample-completion-threshold 0.9
[2018-12-03 16:10:27 duplicated_samples INFO]   --out data_clean_up.2018-12-03_15.48.04/1_duplicated_samples/dup_samples
[2018-12-03 16:10:27 duplicated_samples INFO] Reading TFAM
[2018-12-03 16:10:27 duplicated_samples INFO] Finding duplicated samples
[2018-12-03 16:10:27 duplicated_samples INFO] Creating TFAM for unique samples
[2018-12-03 16:10:27 duplicated_samples INFO] Reading TPED (and creating TPED for unique samples)
Traceback (most recent call last):
  File "/home/grundaj/envs/pyGenClean/bin/run_pyGenClean", line 11, in <module>
    sys.exit(safe_main())
  File "/home/grundaj/envs/pyGenClean/lib/python2.7/site-packages/pyGenClean/run_data_clean_up.py", line 3581, in safe_main
    main()
  File "/home/grundaj/envs/pyGenClean/lib/python2.7/site-packages/pyGenClean/run_data_clean_up.py", line 196, in main
    options=options,
  File "/home/grundaj/envs/pyGenClean/lib/python2.7/site-packages/pyGenClean/run_data_clean_up.py", line 310, in run_duplicated_samples
    duplicated_samples.main(options)
  File "/home/grundaj/envs/pyGenClean/lib/python2.7/site-packages/pyGenClean/DupSamples/duplicated_samples.py", line 99, in main
    args.tfile + ".tped", args.out)
  File "/home/grundaj/envs/pyGenClean/lib/python2.7/site-packages/pyGenClean/DupSamples/duplicated_samples.py", line 983, in processTPED
    tped = np.array(tped)
MemoryError
lemieuxl commented 5 years ago

We've never had a memory issue before for this analysis, but we usually have a few hundreds duplicated samples.

How many duplicated samples do you have in the dataset?

Also, what is numpy's version? We're running the tool on a legacy installation with numpy version 1.11.2. Maybe newer versions of numpy are more restrictive for memory usage...

jgrundstad commented 5 years ago

I'm seeing only 130 duplicated samples. Current numpy version is 1.15.4, I'll try rolling that back and let you know. Thanks for the quick response! Jason

jgrundstad commented 5 years ago

Still no go. Here's my current python 2.7.13 environment:

cycler==0.10.0
decorator==4.0.9
functools32==3.2.3.post2
ipython==4.1.1
ipython-genutils==0.1.0
Jinja2==2.9.5
MarkupSafe==1.0
matplotlib==2.0.0
numpy==1.11.2
path.py==8.1.2
pexpect==4.0.1
pickleshare==0.6
ptyprocess==0.5.1
pyGenClean==1.8.3
pyparsing==2.2.0
python-dateutil==2.6.0
pytz==2016.10
scikit-learn==0.18.1
scipy==0.19.0
simplegeneric==0.8.1
six==1.10.0
subprocess32==3.2.7
traitlets==4.1.0
lemieuxl commented 5 years ago

I just did a simulation using 870,358 markers and 130 pairs of duplicated samples (for a total of 260 samples) and it went through just fine on a machine with 47G of RAM.

How many times each sample were duplicated?

The tests were made on a Miniconda environment with the following packages.

$ conda list --export
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
cairo=1.12.18=6
cycler=0.10.0=py27_0
dbus=1.10.10=0
drmaa=0.7.6=py27_0
expat=2.1.0=0
fontconfig=2.11.1=6
freetype=2.5.5=1
funcsigs=1.0.2=py27_0
glib=2.43.0=1
gst-plugins-base=1.8.0=0
gstreamer=1.8.0=0
harfbuzz=0.9.39=1
icu=54.1=0
jinja2=2.8=py27_1
jpeg=8d=2
libffi=3.2.1=0
libgcc=5.2.0=0
libgfortran=3.0.0=1
libpng=1.6.22=0
libxcb=1.12=1
libxml2=2.9.2=0
markupsafe=0.23=py27_2
matplotlib=1.5.3=np111py27_1
mock=2.0.0=py27_0
nomkl=1.0=0
nose=1.3.7=py27_1
numpy=1.11.2=py27_nomkl_0
openblas=0.2.14=4
openssl=1.0.2j=0
pandas=0.19.1=np111py27_0
pango=1.39.0=1
pbr=1.10.0=py27_0
pip=9.0.1=py27_0
pixman=0.32.6=0
pycairo=1.10.0=py27_0
pygenclean=1.8.3=py27_0
pyparsing=2.1.4=py27_0
pyplink=1.3.3=py27_0
pyqt=5.6.0=py27_0
python=2.7.12=1
python-dateutil=2.6.0=py27_0
pytz=2016.7=py27_0
qt=5.6.0=1
readline=6.2=2
scikit-learn=0.18.1=np111py27_nomkl_0
scipy=0.18.1=np111py27_nomkl_0
setuptools=27.2.0=py27_0
sip=4.18=py27_0
six=1.10.0=py27_0
sqlite=3.13.0=0
system=5.8=2
tk=8.5.18=0
wheel=0.29.0=py27_0
zlib=1.2.8=3
jgrundstad commented 5 years ago

I've scanned through the allele lengths of the markers and found some upwards of 2500 in length (the array manifest confirms the longest at 2515bp) . I've removed all markers with len > 16 observed in the first sample's column. I'm re-running now and it appears to have processed beyond the previous failure point. Could those long alleles be tripping the MemoryError?

lemieuxl commented 5 years ago

Oh great catch. Indeed, the length of the alleles will impact the RAM usage of the numpy array. When created an array with strings, numpy allocates the same amount of ASCII characters for each marker. Hence, you have a minimum of 2515 characters allocated for each value in the array (times two if there is a homozygous of this allele somewhere).

We are aiming at updating pyGenClean so that it is more efficient, and this kind of error would likely not happen again, since we'll be using additive representation instead of the ACGT calls. I just didn't have time to start the refactor...

jgrundstad commented 5 years ago

Ah, I think I see, so:

1,448 samples x 869,831 markers x 2515 bp x 2 alleles = 6.335 trillion characters

does that jibe? Or should samples not be included as a dimension?

In any case, by removing the longer alleles we've made it through the concordance calculations, but now I've got a TypeError during the Printing the concordance file step. Is this likely related? Let me know if you'd like me to start a new issue:

[2018-12-04 09:34:57 pyGenClean INFO] pyGenClean version 1.8.3
[2018-12-04 09:34:57 pyGenClean INFO] Using Plink version v1.07
[2018-12-04 09:34:57 pyGenClean INFO] Reading configuration file [ /home/grundaj/src/pg_pyGenClean/phase1.ini ]
[2018-12-04 09:34:57 pyGenClean INFO] Counting initial number of samples and markers
[2018-12-04 09:34:59 pyGenClean INFO]   - 1,448 samples
[2018-12-04 09:34:59 pyGenClean INFO]   - 869,831 markers
[2018-12-04 09:34:59 pyGenClean INFO] Running 1 duplicated_samples
[2018-12-04 09:34:59 pyGenClean INFO]   - Using short as prefix for input files
[2018-12-04 09:34:59 pyGenClean INFO]   - Results will be in [ data_clean_up.2018-12-04_09.34.57/1_duplicated_samples ]
[2018-12-04 09:34:59 duplicated_samples INFO] Options used:
[2018-12-04 09:34:59 duplicated_samples INFO]   --sample-concordance-threshold 0.97
[2018-12-04 09:34:59 duplicated_samples INFO]   --tfile short
[2018-12-04 09:34:59 duplicated_samples INFO]   --sample-completion-threshold 0.9
[2018-12-04 09:34:59 duplicated_samples INFO]   --out data_clean_up.2018-12-04_09.34.57/1_duplicated_samples/dup_samples
[2018-12-04 09:34:59 duplicated_samples INFO] Reading TFAM
[2018-12-04 09:34:59 duplicated_samples INFO] Finding duplicated samples
[2018-12-04 09:34:59 duplicated_samples INFO] Creating TFAM for unique samples
[2018-12-04 09:34:59 duplicated_samples INFO] Reading TPED (and creating TPED for unique samples)
[2018-12-04 09:38:45 duplicated_samples INFO] Computing the completion and concordance of duplicated samples
[2018-12-04 10:13:14 duplicated_samples INFO] Printing the statistics
[2018-12-04 10:13:14 duplicated_samples INFO] Printing the concordance file
Traceback (most recent call last):
  File "/home/grundaj/envs/pyGenClean2/bin/run_pyGenClean", line 11, in <module>
    sys.exit(safe_main())
  File "/home/grundaj/envs/pyGenClean2/lib/python2.7/site-packages/pyGenClean/run_data_clean_up.py", line 3581, in safe_main
    main()
  File "/home/grundaj/envs/pyGenClean2/lib/python2.7/site-packages/pyGenClean/run_data_clean_up.py", line 196, in main
    options=options,
  File "/home/grundaj/envs/pyGenClean2/lib/python2.7/site-packages/pyGenClean/run_data_clean_up.py", line 310, in run_duplicated_samples
    duplicated_samples.main(options)
  File "/home/grundaj/envs/pyGenClean2/lib/python2.7/site-packages/pyGenClean/DupSamples/duplicated_samples.py", line 136, in main
    concordance_percentage = printConcordance(concordance, args.out)
  File "/home/grundaj/envs/pyGenClean2/lib/python2.7/site-packages/pyGenClean/DupSamples/duplicated_samples.py", line 654, in printConcordance
    concordance[key][1][none_zero],
TypeError: NumPy boolean array indexing assignment requires a 0 or 1-dimensional input, input has 2 dimensions
lemieuxl commented 5 years ago

You should multiply the number of duplicated samples instead of the total number of samples, but yeah, the number gets big (around a TB of RAM I think).

260 samples  x  869831 markers  x  2515 characters  x  2 = 1.03 TB

As for the TypeError you're having, I'm having the exact same issue with newer versions of numpy (i.e. 1.15.4). When downgrading to 1.11.2, I don't have this error anymore.

jgrundstad commented 5 years ago

I was referencing an older virtualenv that had the updated numpy, and reverting to 1.11.2 fixed the issue. Thanks again for spending the time to help!

lemieuxl commented 5 years ago

No problem. Glad it worked!