Unexpected exceptions when the FrameSize and the PageSize is different

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1.Set the FrameSize=4096, Set PageSize=327680
2.Run dml/using-prefix-merge-policy/ test case under 
asterix-app/src/test/resource/runtimets/queries

What is the expected output? What do you see instead?
Expected: That test should pass.

The system hung there and after a while one exception was thrown:

java.lang.IllegalStateException: Trying to flush when writerCount != 0
        at edu.uci.ics.hyracks.storage.am.lsm.common.impls.AbstractMemoryLSMComponent.threadEnter(AbstractMemoryLSMComponent.java:97)
        at edu.uci.ics.hyracks.storage.am.lsm.common.impls.LSMHarness.enterComponents(LSMHarness.java:105)
        at edu.uci.ics.hyracks.storage.am.lsm.common.impls.LSMHarness.getAndEnterComponents(LSMHarness.java:80)
        at edu.uci.ics.hyracks.storage.am.lsm.common.impls.LSMHarness.scheduleFlush(LSMHarness.java:288)
        at edu.uci.ics.hyracks.storage.am.lsm.common.impls.LSMTreeIndexAccessor.scheduleFlush(LSMTreeIndexAccessor.java:118)
        at edu.uci.ics.asterix.common.context.PrimaryIndexOperationTracker.triggerScheduleFlush(PrimaryIndexOperationTracker.java:137)
        at edu.uci.ics.asterix.transaction.management.service.logging.LogPage.notifyFlushTerminator(LogPage.java:267)
        at edu.uci.ics.asterix.transaction.management.service.logging.LogPage.batchUnlock(LogPage.java:230)
        at edu.uci.ics.asterix.transaction.management.service.logging.LogPage.internalFlush(LogPage.java:198)
        at edu.uci.ics.asterix.transaction.management.service.logging.LogPage.flush(LogPage.java:173)
        at edu.uci.ics.asterix.transaction.management.service.logging.LogFlusher.call(LogManager.java:496)
        at edu.uci.ics.asterix.transaction.management.service.logging.LogFlusher.call(LogManager.java:432)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Please use labels and text to provide additional information.

This is in master branch, and the FrameSize=4096 for that test case is enough. 
Sometimes in my local machine I can see the "too many open files" io exceptions 
as well.

Original issue reported on code.google.com by jianfeng...@gmail.com on 16 May 2015 at 6:50

GoogleCodeExporter commented 8 years ago

It looks very similar to issue 878.  It seems LSM state machine still has some 
problems.

Original comment by buyingyi@gmail.com on 16 May 2015 at 6:57

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

Original comment by buyingyi@gmail.com on 16 May 2015 at 6:58

Added labels: Priority-Critical
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

I investigated this issue and found why this situation occurs. 
There are two problems behind this situation. 

1. When an in-memory component is about to be scheduled to be flushed, 
the component's writer counter can be greater than 0 based on the current code 
base/design. 
So, this should be considered as a valid situation (rather than an exception) 
and 
dealt with appropriately, i.e., the component must not be scheduled to be 
flushed.

2. The number of buffer pages are not enough if the frame size is set to 4096 
and the
buffer cache page size is set to 327680 in default 
asterix-build-configuration.xml.
If the configuration file is set in this way, there are at most 11 pages in 
buffer cache, 
so buffer cache can not find victim pages since all pages are used. Thus, the 
test 
is hung. 
So, the buffer cache size (not the buffer cache page size) should be increased 
appropriately to avoid this situation. 

I have a fix for problem 1 and I will send a code review after I add comments 
to the fixed
code such as why the situation can occur with a detail example scenario.

Lastly, this issue is different from the situation described in issue 878 even 
though both issue came from LSMHarness layer.

Original comment by kiss...@gmail.com on 18 May 2015 at 5:02

GoogleCodeExporter commented 8 years ago

There are two partitions in a NC: one is called p0 and another is called p1.
The number of in-memory components in each partition is two (just like a double 
buffering). 

Another important thing is that when there are insert/delete/search operations, 
for each
partition in the NC has a corresponding operation job pipe line and each of job 
pipe line
is executed by a separated thread. With aforementioned setting, there will be 2 
job pipe
lines when there is a delete AQL statement, for example, and 2 threads (one for 
each
pipe line). 

For each dataset in a NC, there is only one PrimaryIndexOperationTracker 
instance 
regardless of the number of partitions in the NC. 
I believe this design is affected by the design decision for the memory 
resource 
management, where all indexes of a dataset in a NC has a common memory budget 
and when an
index in the dataset recognizes that memory budget is exceeded, current mutable 
in-memory
component of all indexes in the dataset should be flushed and the next 
in-memory component
should be ready to get the incoming updates. 

LSMHarness is a layer that orchestrates all incoming operations in order to 
conform 
the LSM index framework protocol which correctly changes the state of LSM index 
components
where the state indicates that whether the component is readable or writable 
and/or the 
component is being flushed or merged, etc. LSMHarness is a gateway of each 
component 
through which every operations should enter and exit before and after each 
operation,
respectively. Enter and exit each component methods are operations protected by 
PrimaryIndexOperationTracker-synchronized block. 

With this setup and the delete example, thread T1 and thread T2 of delete job 
pipelines
can see the memory budget of the dataset is exceeded independently. Once this 
happens, T1
and T2 may generate FLUSH log independently. When FLUSH log is generated, both
current mutable(or active) components' state is changed to READABLE_UNWRITABLE 
and then
FLUSH log is written to the log tail. 
When a FLUSH log is written to disk, scheduleFlush() is triggered. 
The scheduleFlush() changes the component's state from READABLE_UNWRITABLE to 
READABLE_UNWRITABLE_FLUSHING if the component is modified. In other words, if
the component wasn't modified, the component is not scheduled to be flushed. 
In addition, the unmodified component is changed back from READABLE_UNWRITABLE 
to 
READABLE_WRITABLE. 

Now, let's see the following scenario which can introduce a situation where an 
in-memory
component can have the writer count value greater than 0 when scheduleFlush() 
is triggered.
So, based on the current code base/design, writer count of an in-memory 
component can be 
greater than 0 during scheduleFlush(). Thus, this situation properly handled 
instead of
throwing an exception.  That is, the component whose writer counter value is 
greater than
0 must not be scheduled to be flushed in scheduleFlush() method. 

------------
A scenario
: time flows from top to bottom
------------

TS1: (timestamp 1)

p0 mc state - RW< | IA
(p0's in-memory components state: 
first in-memory component mc0 readable_writable and second component mc1 is 
inactive,
"<" represents a current mutable/active component)
p1 mc state - RW< | IA
(p1's in-memory components state: 
first in-memory component mc0 readable_writable and second component mc1 is 
inactive)

TS2:
thread T1 modifies p0-mc0 and see memory budget is exceeded and changes mc 
state.
p0 mc state - RU<* | IA (RU: readable_unwritable and "*" represents that it's 
modified)

TS3:
T1 calls flushIfRequested() -> which changes p1 mc state
p1 mc state - RU< | IA

TS4:
T1 forms FLUSH log and put it to log tail

TS5:
FLUSH log is flushed to disk and triggers scheduleFlush() by LogFlusher thread. 

TS6:
scheduleFlush() changes p0 and p1 mc state
p0 mc state - RUF* | IA< (RUF: readable_unwritable_flushing)
p1 mc state - RW< | IA (scheduleFlush() put the state back to readable_writable 
since it is not modified)

TS7:
thread T2 modifies p1-mc0 and see memory budget is exceeded and changes mc 
state.
p1 mc state - RU<* | IA

TS8:
flushIfRequested() by T2

TS9:
T2 forms FLUSH log and put it to log tail (not flushed yet)

TS10:
T1 tries to modify p0-mc1 which changes p0 mc state
p0 mc state - RUF* | RW<

TS11
T1 modifies p0-mc1 and see memory budget is exceeded and changes mc state.
p0 mc state - RUF* | RU<*  <- mc1's writer count can be 1

TS12:
the FLUSH log (created in TS9) is flushed to disk and triggers scheduleFlush() 
by LogFlusher thread.

TS13:
When LogFlusher thread's scheduleFlush() calls 
AbstractMemoryLSMComponent.threadEnter(), 
p0 mc state is RU<* and it's writer count is greater than 0.
Therefore, the exception(java.lang.IllegalStateException: Trying to flush when 
writerCount != 0)
 is thrown!!!.

Original comment by kiss...@gmail.com on 18 May 2015 at 7:05

GoogleCodeExporter commented 8 years ago

I also attached a jpg file which shows the same scenario.

Original comment by kiss...@gmail.com on 18 May 2015 at 7:07

Attachments:

[issue884 error reproduce scenario.JPG](https://storage.googleapis.com/google-code-attachments/asterixdb/issue-884/comment-5/issue884 error reproduce scenario.JPG)

shamim8888 / asterixdb

Unexpected exceptions when the FrameSize and the PageSize is different #884