snewhouse / glu-genetics

Automatically exported from code.google.com/p/glu-genetics
Other
0 stars 1 forks source link

Memory consumption #2

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Recently I built a binary file for the HapMap CEU data build 23a using the
glu version 1.06alpha. When I extract a set of samples (90) and loci (1536)
from this data set, I noticed that glu can consume 4.7 GB of memory. I have
NOT seen this when I perform the same procedure with the HapMap r22 binary
file (only takes about 2 GB). Clearly there is a difference between the new
hapmap build and old build binary files. But I couldn't recall what version
of glu was used in building the previous hapmap binary file.

Original issue reported on code.google.com by jlu...@gmail.com on 15 Oct 2008 at 3:48

GoogleCodeExporter commented 9 years ago
The major issue here is that HapMap build 23a has about 125k monomorphic loci 
included that are not in build 22.  GLU is extremely inefficient in how it 
deals with  
incomplete genotype models, which results in huge memory growth in datasets 
like 
build 23a.  

I have several strategies in mind to combat this extreme memory growth, but it 
involves some fairly large changes to the encoding and recoding routines.  I'll 
keep 
this  issue updated with my progress.

Original comment by bioinformed@gmail.com on 17 Oct 2008 at 2:37

GoogleCodeExporter commented 9 years ago
This has long since been fixed in r852:

>>> MAJOR REWRITE OF GENOTYPE MODEL ENCODING <<<

The old way of managing models enforced that model objects were constant,
though alleles and genotypes could be added after creation.  The new
invariant is that models are fixed after creation, but may be replaced with
new models that expose additional alleles and genotypes.  In both cases,
model updates preserved genotype indexes, such that existing binary encoded
genotype arrays produced the same results before and after model updates.

This change is primarily aimed at reducing the memory footprint of GLU when
dealing with datasets with many incomplete (ie, non-full) models. This was
first reported when Jun used HapMap build23, since it included 125k
monomorphic SNPs (incomplete models).  Over 4.7 GB of RAM (2m22s) was needed
to subset the data using GLU 1.0a5 with the old model management strategy,
but now only 1.4 GB of RAM (1m11s) are needed to perform the same
operations.  A pleasant side-effect is that runtime performance is greatly
improved for this and many other operations.  This 3.35x reduction in the
amount of memory requires is a substantive start on optimizing GLU for
operation on more modest desktop hardware, though clearly more work is
needed.

Original comment by bioinformed@gmail.com on 14 Oct 2009 at 6:43