About prefix merge policy behavior

I describe how current prefix merge policy works based on the observation from 
ingestion experiments. 
Also, the similar observation was observed by Sattam as well. 
The observed behavior seems a bit unexpected, so I post the observation here to 
consider better merge policy and/or better lsm index design regarding merge 
operations.

The aqls used for the experiment are shown at the end of this writing.

Prefix merge policy decides to merge disk components based on the following 
conditions
1.  Look at the candidate components for merging in oldest-first order.  If one 
exists, identify the prefix of the sequence of all such components for which 
the sum of their sizes exceeds MaxMergableComponentSize.  Schedule a merge of 
those components into a new component.
2.  If a merge from 1 doesn't happen, see if the set of candidate components 
for merging exceeds MaxToleranceComponentCnt.  If so, schedule a merge all of 
the current candidates into a new single component.
Also, the prefix merge policy doesn't allow concurrent merge operations for a 
single index partition. 
In other words, if there is a scheduled or an on-going merge operation, even if 
the above conditions are met, the merge operation is not scheduled. 

Based on this merge policy, the following situation can occur. 
Suppose MaxToleranceCompCnt = 5 and 5 disk components were flushed to disk.
When 5th disk component is flushed, the prefix merge policy schedules a merge 
operation to merge the 5 components. 
During the merge operation is scheduled and starts merging, concurrently 
ingested records generates more disk components. 
As long as a merge operation is not fast enough to catch up the speed of 
generating 5 disk components by incoming ingested records, 
the number of disk components increases as time goes. 
So, the slower merge operations are, the more disk components there will be as 
time goes. 

I also attached a result of a command, "ls -alR <directory of the asterixdb 
instance for an ingestion experiment>" which was executed after the ingestion 
is over.
The attached file shows that for primary index (whose directory is 
FsqCheckinTweet_idx_FsqCheckinTweet), ingestion generated 20 disk components, 
where each disk component consists of btree (the filename has suffix _b) and 
bloom filter (the filename has suffix_f) and MaxMergableComponentSize is set to 
1GB.
It also shows that for the secondary index (whose directory is 
FsqCheckinTweet_idx_sifCheckinCoordinate), ingestion generated more than 1400 
components, where each disk component consist of a dictionary btree (suffix: 
_b), an inverted list (suffix: _i), a deleted-key btree (suffix: _d), and a 
bloom filter for the deleted-key btree (suffix: _f).
Even if the ingestion was over, since our merge operation happens 
asynchronously, the merge operation continues and eventually merge all mergable 
disk components according to the describe merge policy.

------------------------------------------
AQLs for the ingestion experiment
------------------------------------------
drop dataverse STBench if exists;
create dataverse STBench;
use dataverse STBench;

create type FsqCheckinTweetType as closed {
    id: int64,
    user_id: int64,          
    user_followers_count: int64,
    text: string,
    datetime: datetime,             
    coordinates: point, 
    url: string? 
}
create dataset FsqCheckinTweet (FsqCheckinTweetType) primary key id

/* this index type is only available kisskys/hilbertbtree branch. however, you 
can easily replace sif index to inverted keyword index on the text field and 
you will see similar behavior */
create index sifCoordinate on FsqCheckinTweet(coordinates) type sif(-180.0, 
-90.0, 180.0, 90.0);

/* create feed */
create feed  TweetFeed
using file_feed
(("fs"="localfs"),
("path"="127.0.0.1:////Users/kisskys/Data/SynFsqCheckinTweet.adm"),("format"="ad
m"),("type-name"="FsqCheckinTweetType"),("tuple-interval"="0"));

/* connect feed */
use dataverse STBench;
set wait-for-completion-feed "true";
connect feed TweetFeed to dataset FsqCheckinTweet;
Original issue reported on code.google.com by kiss...@gmail.com on 15 Apr 2015 at 9:33
Attachments:
storage-layout.txt
lwhay / asterixdb

About prefix merge policy behavior #868