Closed GoogleCodeExporter closed 9 years ago
The major issue here is that HapMap build 23a has about 125k monomorphic loci
included that are not in build 22. GLU is extremely inefficient in how it
deals with
incomplete genotype models, which results in huge memory growth in datasets
like
build 23a.
I have several strategies in mind to combat this extreme memory growth, but it
involves some fairly large changes to the encoding and recoding routines. I'll
keep
this issue updated with my progress.
Original comment by bioinformed@gmail.com
on 17 Oct 2008 at 2:37
This has long since been fixed in r852:
>>> MAJOR REWRITE OF GENOTYPE MODEL ENCODING <<<
The old way of managing models enforced that model objects were constant,
though alleles and genotypes could be added after creation. The new
invariant is that models are fixed after creation, but may be replaced with
new models that expose additional alleles and genotypes. In both cases,
model updates preserved genotype indexes, such that existing binary encoded
genotype arrays produced the same results before and after model updates.
This change is primarily aimed at reducing the memory footprint of GLU when
dealing with datasets with many incomplete (ie, non-full) models. This was
first reported when Jun used HapMap build23, since it included 125k
monomorphic SNPs (incomplete models). Over 4.7 GB of RAM (2m22s) was needed
to subset the data using GLU 1.0a5 with the old model management strategy,
but now only 1.4 GB of RAM (1m11s) are needed to perform the same
operations. A pleasant side-effect is that runtime performance is greatly
improved for this and many other operations. This 3.35x reduction in the
amount of memory requires is a substantive start on optimizing GLU for
operation on more modest desktop hardware, though clearly more work is
needed.
Original comment by bioinformed@gmail.com
on 14 Oct 2009 at 6:43
Original issue reported on code.google.com by
jlu...@gmail.com
on 15 Oct 2008 at 3:48