Runtime error, memory allocation, segmentation foult

jregalad commented 8 years ago

Hi Kevin, I'm trying to run kwip on metgenomic datasets. These are Illumina 2 X 150 paired end reads. I have a total of 90 samples with an average number of 1.5 million read pairs.

Unfortunately I've only been able to run kwip successfully once. I am stumbling with two errors:

kwip version 0.2.0-rc1-6-gd109a13 Calculating entropy weighting vector:

Loading hashes into a population frequency vector: terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc /var/spool/gridengine/execd/node509/job_scripts/4481896: line 41: 28855 Aborted /ebio/abt6_projects9/microbiome_analysis/data/software/bin/kwip -t 10 -k 90sample_subsampled.kern -d 90sample_subsampled.dist -v hashcounts/*.ct.gz ############################################################################### This error occurs early during the execution of kwip, and it occurs earlier if I use more cores. Eg Almost immediately if I set -t to the 64 available cores. I'm using a 1TB node but based on the memory profile obtained with vmstat, I still have a lot of RAM left before kwip dies: free_mem.pdf

By setting the -t to only 2, I get kwip to start logging the loading of hashes, unfortunately eventually the script dies with segfault.

kwip version 0.2.0-rc1-6-gd109a13 Calculating entropy weighting vector:

Loading hashes into a population frequency vector:
- Loaded 'hashcounts/sample01Sampled.ct.gz' (1)
- Loaded 'hashcounts/sample02Sampled.ct.gz' (2)
- Loaded 'hashcounts/sample04Sampled.ct.gz' (4)
- Loaded 'hashcounts/sample03Sampled.ct.gz' (3)
- Loaded 'hashcounts/sample06Sampled.ct.gz' (5)
- Loaded 'hashcounts/sample07Sampled.ct.gz' (6)
- Loaded 'hashcounts/sample08Sampled.ct.gz' (7)
- Loaded 'hashcounts/sample09Sampled.ct.gz' (8)
- Loaded 'hashcounts/sample10Sampled.ct.gz' (9)
- Loaded 'hashcounts/sample11Sampled.ct.gz' (10)
- Loaded 'hashcounts/sample12Sampled.ct.gz' (11)
- Loaded 'hashcounts/sample13Sampled.ct.gz' (12)
- Loaded 'hashcounts/sample14Sampled.ct.gz' (13)
- Loaded 'hashcounts/sample15Sampled.ct.gz' (14)
- Loaded 'hashcounts/sample16Sampled.ct.gz' (15)
- Loaded 'hashcounts/sample17Sampled.ct.gz' (16)
- Loaded 'hashcounts/sample18Sampled.ct.gz' (17)
- Loaded 'hashcounts/sample19Sampled.ct.gz' (18)
- Loaded 'hashcounts/sample20Sampled.ct.gz' (19)
- Loaded 'hashcounts/sample21Sampled.ct.gz' (20)
- Loaded 'hashcounts/sample22Sampled.ct.gz' (21)
- Loaded 'hashcounts/sample23Sampled.ct.gz' (22)
- Loaded 'hashcounts/sample24Sampled.ct.gz' (23)
- Loaded 'hashcounts/sample26Sampled.ct.gz' (24)
- Loaded 'hashcounts/sample27Sampled.ct.gz' (25)
- Loaded 'hashcounts/sample28Sampled.ct.gz' (26)
- Loaded 'hashcounts/sample29Sampled.ct.gz' (27)
- Loaded 'hashcounts/sample30Sampled.ct.gz' (28)
- Loaded 'hashcounts/sample31Sampled.ct.gz' (29)
- Loaded 'hashcounts/sample32Sampled.ct.gz' (30)
- Loaded 'hashcounts/sample33Sampled.ct.gz' (31)
- Loaded 'hashcounts/sample34Sampled.ct.gz' (32)
- Loaded 'hashcounts/sample35Sampled.ct.gz' (33)
- Loaded 'hashcounts/sample37Sampled.ct.gz' (34)
- Loaded 'hashcounts/sample38Sampled.ct.gz' (35)
- Loaded 'hashcounts/sample39Sampled.ct.gz' (36)
- Loaded 'hashcounts/sample40Sampled.ct.gz' (37)
- Loaded 'hashcounts/sample41Sampled.ct.gz' (38)
- Loaded 'hashcounts/sample42Sampled.ct.gz' (39)
- Loaded 'hashcounts/sample43Sampled.ct.gz' (40)
- Loaded 'hashcounts/sample45Sampled.ct.gz' (42)
- Loaded 'hashcounts/sample44Sampled.ct.gz' (41) /var/spool/gridengine/execd/node509/job_scripts/4480978: line 41: 27897 Segmentation fault /ebio/abt6_projects9/microbiome_analysis/data/software/bin/kwip -t 2 -k 90sample_subsampled.kern -d 90sample_subsampled.dist -v hashcounts/*.ct.gz ############################################################################### Again, there is still quite a lot of memory left in the node: free_mem2C.pdf

I'm running kwip on a cluster with SGE version GE 6.2u5. On 1TB 64 core nodes.

If I run kwip directly on the server, I obtain the same errors.

kdm9 commented 8 years ago

Hi Julian,

Sorry you've encountered this error. I'm still traveling at the moment, so I may be a little slow in debugging this, but lets have a go anyway.

First bit of information I need is the parameters to load-into-counting.py you used to hash reads. This is what determines the amount of memory before. I've seen the std::bad_alloc issue before only when there is insufficient memory (which there may be, 1024GB/64 threads = 16GB/thread isn't super high).

The segfault is a bug. If you're comfortable doing so, could you please post a gdb backtrace? Otherwise, I'll do my best to replicate it. Others have reported similar issues, and I've never been successful in tracing them down. I believe it is an issue with handling an edge case where a bad_alloc happens while reading in the khmer files (a part of khmer's code that we use, not within our code).

One possible test would be to run load-into-counting.py with parameters -N 1 -x 5e9 -k 20 and see if kwip runs successfully. There should be plenty of RAM to deal with this size of dataset. If you get a warning or error about a high false positive rate from khmer, ignore it for now by adding the -f flag to force writing.

Thanks, and sorry you've not had better luck with the software.

Kevin

kdm9 commented 8 years ago

Hi Julian,

Sorry to bug you, but I'm back in Canberra now and was wondering if you resolved this issue? We plan on making the publication release of the software fairly soon and hope to fix any remaining bugs this coming fortnight.

Cheers, Kevin

jregalad commented 8 years ago

Hi Kevin,

Sorry for taking o long to answer. After your last set of instructions kwip ran perfectly. Didn't get any error again. I can still recreate the memory isues if that can help you debug the program. Also, I ran kwip on metagenomic data and it return some very nice results. Let me know if you want to know more about kwip on metagenomes.

best, Julian

kdm9 commented 8 years ago

Good to hear, closing the issue

kdm9 / kWIP

Runtime error, memory allocation, segmentation foult #16