Re-work how we handle our metric data model to reduce memory consumption

GoogleCodeExporter commented 9 years ago

Background:
Currently we create the relational metric data model and store it in 
memory, persisting the entire model to CSV once it has been created. This 
however, does not scale well to large systems due to high memory 
consumption.

Issue:
Two changes need to be made to keep memory requirements lower.

1) As versions are processed, they should be persisted to file (CSV) 
incrementally, and removed from memory.
2) Because we cannot keep the entire model in memory, we need a way of 
streaming/or loading versions into memory on demand. One approach would be 
to have the HistoryMetricData.getVersion(rsn) transparently handle loading 
Versions from CSV file if they are not in memory. Alternatively, use weak 
references to the metric model.

Original issue reported on code.google.com by jtha...@gmail.com on 19 Aug 2007 at 1:13

GoogleCodeExporter commented 9 years ago

Because of the amount of collection copies having to be performed with the 
current 
multi-threaded approch to processing versions I have dropped the use of a 
ThreadPool 
and respective Executor. I also wanted a simple way to incrementally persist 
versions to file as they are proccessed.

I've changed the multi-threading approach to used shared channels (blocking 
queues) 
and active objects.

VersionFileChannel <----- VersionLoader    ------> VersionChanel
VersionChannel     <----- VersionPersister ------> CSV

The VersionLoader takes version names from a channel and puts processed 
versions 
onto another channel. The VersionPersister takes processed versions from the 
version 
channel and persists them to file.

This approach is both efficient (can scale across multiple threads for 
increased 
performance) and scalable (to potentially infinite number of versions).

The aforementioned has already been implemented but two problems still remain.

Problem1: PostProcessing
I'm still debating on the best spot to perform post processing now that 
versions 
need to be persisted as they are processed.

Problem2: Use and access of VersionMetricData
As outlined in the background, there is at least two ways to implement this. 
One 
problem I had initially not thought about was retrieving a collection of 
versions 
from a history object. This is something we do in many different places. But 
because 
we cannot store an entire model in memory, we cannot return a collection of 
versions.

I've taken a brief look into implementing weak references, but don't know 
enough 
about implementing them to know how much work is involved.

Original comment by jtha...@gmail.com on 19 Aug 2007 at 1:45

GoogleCodeExporter commented 9 years ago

I should also add...

I have already implemented the below scenario in light of our persistence 
changes 
and it works fine. You can now retrieve any version from a history (as you 
normally 
would), and the history takes care of loading it from file.

HistoryMetricData hmd = //
VersionMetricData vmd = hmd.getVersions(3);

Currently there is no caching in memory of versions. Whenever a version is 
requested 
it is parsed from file transparently by the HistoryMetricData. A simple caching 
of 
the last x versions loaded could be kept in memory, which is checked first 
before 
loading from file to improve performance. Similar to a LRU algorithm. This is 
something which can quite easily be added at a later date. (If we stick with 
this 
approach I will add an issue for it so I remember). For now, it is noted here.

The real problem lies in our usage of hmd.getVersions(); that is proliferated 
throughout the system. The whole reason we are persisting is because we cannot 
return the entire model at once.

One approach may be to implement an iterator, which provides an iteration of 
the 
above approach. However, it would then be up to the caller to unload versions 
when 
they are finished using them. Or, whenever next() is called, the previous 
version is 
unloaded automatically by the iterator. This behaviour would be documented so 
that 
the user knows that when they call next(), they no longer have access to the 
previous version as it is freed.

I haven't really got a solution for this at the moment so I'm open to 
suggestions.

Original comment by jtha...@gmail.com on 19 Aug 2007 at 1:58

GoogleCodeExporter commented 9 years ago

Testing
---------------
I updated my Hibernate versions file (which is normally about ~50 versions) and 
copy 
and pasted it a few times to simulate having more jars (versions) as we don't 
having 
anything bigger to test with (I don't want to touch eclipse at the moment :). 
I'm 
currently testing with about 200 versions of Hibernate.

JSeat will chew through this now quite happily without a bump in heap space 
size 
which is nice. However, I suspect it might be slightly faster if you set a 
slightly 
larger minimum and maximum heap space size so it doesn't have to be quite so 
aggressive with the garbage collector. Constantly garbage collecting = less 
time 
processing.

All of our post proecssing can be done to a single version at a time which is 
good 
with the exception of 'scanAndMarkSurvivors' where we need to look ahead. I 
need to 
re-write this to keep two versions in memory. This requirement however means 
you 
really need to wait until persistence has finished, then re-open two files, 
post 
process and re-serialize with updated changes. I suspect the opening and 
closing of 
files is what slows this phase down. Also, the fact that it cannot begin until 
all 
versions have been serialized.

I'm changing the post processing (which we use to do each time the data was 
loaded 
to save space) to be a once off thing when the data is first serialized. This 
way 
once you have created your dataset no more extraction or post-processing needs 
to be 
done on future runs. Versions are just loaded straight up from file as needed.

Original comment by jtha...@gmail.com on 21 Aug 2007 at 1:35

GoogleCodeExporter commented 9 years ago

Fixed:
See Issue 15 (Persistence framework changes) and issue 16 (Moving to a 
smarter/lighter data loading framework).

Of note is that we currently only support requesting versions individually, we 
do 
not return an entire collection (this usage had to be removed). At the moment 
this 
functionality has to be wrapped in a for loop by the caller to access all 
versions.

I haven't rules out the Iterator idea. As future work I'll note it down as an 
addition to the data loading strategy.

Original comment by jtha...@gmail.com on 28 Aug 2007 at 5:28

Changed state: Fixed
Added labels: Type-Defect
Removed labels: Type-Enhancement

rvasa / jseat

Re-work how we handle our metric data model to reduce memory consumption #14