pblischak / HyDe

Hybridization detection using phylogenetic invariants
http://hybridization-detection.readthedocs.io
GNU General Public License v3.0
41 stars 15 forks source link

OverflowError: value too large to convert to int #8

Open ethering opened 5 years ago

ethering commented 5 years ago

Hi, I've successfully installed HyDe and run both the test data, along with some test data of my own (5Kb sequences of the whole genome data below).

Now, I'm trying to input whole genome sequences of a 2.4Gb each for 24 species over 5 taxa, but I'm getting the following error:

Command line: run_hyde_mp.py -i hyde_samples.phy -m map.txt -o OUT -j 16 -n 24 -t 5 -s 2410758013

Error:

Traceback (most recent call last):
  File "run_hyde_mp.py", line 141, in <module>
    data = hd.HydeData(infile, mapfile, outgroup, nind, ntaxa, nsites, quiet)
  File "phyde/core/data.pyx", line 108, in phyde.core.data.HydeData.__init__
OverflowError: value too large to convert to int

I'm providing 2TB RAM, so presumably it's not a resources problem. I'm wondering if there's a maximum sequence length that I can run here and if you might know what it is (I can then cut my genome sequences into chunks and re-run HyDe on each chunk). Many thanks, Graham

pblischak commented 5 years ago

Hi Graham,

The biggest sequences that we have run through HyDe were about 250 Mb, so maybe that could be a good place to start for splitting things up. If you have chromosome-level information you could also potentially run things chromosome-by-chromosome.

One other thing -- in my experience, using too many threads with the run_hyde_mp.py script when you have a big data set can actually cause analyses to run slower. I'm pretty sure the reason for this is that the built-in multiprocessing library for Python creates a copy of the data for each thread. If you're giving the script a lot of threads, then each one needs a copy and things are really slow. I don't remember the exact numbers off the top of my head, but our analysis of the Heliconius data from our paper was only faster when we used 2 threads, and was slower when we used 4.

ethering commented 5 years ago

Hi, Just to feedback. I tried a few different split sequence sizes (started with 10x 250Mb and then worked down). In the end, I found that only needed to split my genome into two halves to get it to work with HyDe. The first sequences was 1 Gbases long, the second, 1.4 Gbases long. They ran fine. I'm wondering if HyDe is limited by 32 bit signed integer (2,147,483,647). I also found this bug in bedtools when I was trying to use bed files to split my genomes up. If I have time, I'll create sequences that are 2,147,483,647 and 2,147,483,648 in length and see if the both run.