shamim8888 / asterixdb

Automatically exported from code.google.com/p/asterixdb
0 stars 0 forks source link

About correlated-prefix merge policy behavior #869

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
I did the same experiment as described in issue 868 except the merge policy of 
the dataset is correlated prefix merge policy. (the experiment in issue 868 
uses default merge  policy, i.e., prefix merge policy.)

The correlated prefix merge policy will only look at primary indexes in order 
to evaluate if a merge operation is needed. If it decides that a merge 
operation is needed, then it will merge *all* the indexes that belong to the 
dataset. The criteria to decide if a merge is needed is the same as the one 
that is used in the prefix merge policy:
1. Look at the candidate components for merging in oldest-first order.  If one 
exists, identify the prefix of the sequence of all such components for which 
the sum of their sizes exceeds MaxMrgCompSz.  Schedule a merge of those 
components into a new component.
2. If a merge from 1 doesn't happen, see if the set of candidate components for 
merging exceeds MaxTolCompCnt.  If so, schedule a merge all of the current 
candidates into a new single component.

According to the policy, the similar behavior of the prefix merge policy 
explained in issue 868 may occur for the correlated merge policy as well. That 
is, as time goes, the number of secondary index components will increase.
Also, one important difference between the prefix one and correlated prefix one 
is that the current implementation of the correlated merge policy allows 
concurrent merge operations in secondary indexes (but not in primary index). In 
addition, the order of the merge operations are not enforced across concurrent 
merge operations. This may cause a problem described below.

Suppose a situation where 5 disk components from sdc1 to sdc5 are merged into 
sdc5-1 and concurrently sdc6 through sdc10 are merged into sdc10-6. If the 
merge sdc10-6 is completed first and still the merge sdc5-1 is going on, when 
the next merge is scheduled by more flushed disk components, say, sdc11 to sdc 
14, sdc10-6 will be included in the merge operation with sdc11 ~ sdc14 
components. This will cause a problem since so far our merge operation must 
merge only consecutive disk components without making any holes. The above 
situation will leave a hole for the merging component sdc10-6. (Please correct 
me if this explanation is wrong.)

Also, current implementation of the correlated merge policy decides the number 
of components to be merged by picking a minimum number of disk components of 
all indexes in the dataset. Because of this, at the end of the ingestion, many 
disk components in secondary indexes end up being not merged. This situation 
was observed for RTree secondary index as well.  

Original issue reported on code.google.com by kiss...@gmail.com on 15 Apr 2015 at 10:24