lwhay / asterixdb

Automatically exported from code.google.com/p/asterixdb
0 stars 0 forks source link

Abort by lock manager #781

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I'm using two feeds to populate a dataset and hit a lock manager issue. Here is 
the exception that I'm getting:

May 29, 2014 3:20:48 PM edu.uci.ics.hyracks.control.nc.NCDriver main
SEVERE: Setting uncaught exception handler 
edu.uci.ics.hyracks.api.lifecycle.LifeCycleComponentManager@61834bbe
May 29, 2014 3:20:48 PM edu.uci.ics.hyracks.control.nc.NodeControllerService 
start
INFO: Starting NodeControllerService
May 29, 2014 3:20:48 PM 
edu.uci.ics.asterix.hyracks.bootstrap.NCApplicationEntryPoint start
INFO: Starting Asterix node controller  TAKE NOTE: a1_node1
push Job 844424930131968
push Resource 562949953424439
push Request 562949953424439
push Job 562949953421312
push Resource 1970324836974727
push Request 1970324836974719
push Job 844424930131968
edu.uci.ics.hyracks.api.exceptions.HyracksDataException: 
edu.uci.ics.asterix.common.exceptions.ACIDException: Transaction JID:14 should 
abort (requested by the Lock Manager):
Job 3:0:0
Resource 2:0:c37
Request 2:0:c37
Job 2:0:0
Resource 7:0:87
Request 7:0:7f
Job 3:0:0

        at edu.uci.ics.asterix.transaction.management.opcallbacks.PrimaryIndexModificationOperationCallback.before(PrimaryIndexModificationOperationCallback.java:47)
        at edu.uci.ics.hyracks.storage.am.btree.impls.BTree.upsert(BTree.java:331)
        at edu.uci.ics.hyracks.storage.am.btree.impls.BTree.access$500(BTree.java:69)
        at edu.uci.ics.hyracks.storage.am.btree.impls.BTree$BTreeAccessor.upsertIfConditionElseInsert(BTree.java:896)
        at edu.uci.ics.hyracks.storage.am.lsm.btree.impls.LSMBTree.insert(LSMBTree.java:400)
        at edu.uci.ics.hyracks.storage.am.lsm.btree.impls.LSMBTree.modify(LSMBTree.java:337)
        at edu.uci.ics.hyracks.storage.am.lsm.common.impls.LSMHarness.modify(LSMHarness.java:226)
        at edu.uci.ics.hyracks.storage.am.lsm.common.impls.LSMHarness.forceModify(LSMHarness.java:210)
        at edu.uci.ics.hyracks.storage.am.lsm.common.impls.LSMTreeIndexAccessor.forceInsert(LSMTreeIndexAccessor.java:145)
        at edu.uci.ics.asterix.common.dataflow.AsterixLSMInsertDeleteOperatorNodePushable.nextFrame(AsterixLSMInsertDeleteOperatorNodePushable.java:60)
        at edu.uci.ics.asterix.metadata.feeds.FeedMetaOperatorDescriptor$FeedMetaNodePushable.nextFrame(FeedMetaOperatorDescriptor.java:204)
        at edu.uci.ics.hyracks.control.nc.Task.pushFrames(Task.java:316)
        at edu.uci.ics.hyracks.control.nc.Task.run(Task.java:269)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: edu.uci.ics.asterix.common.exceptions.ACIDException: Transaction 
JID:14 should abort (requested by the Lock Manager):
Job 3:0:0
Resource 2:0:c37
Request 2:0:c37
Job 2:0:0
Resource 7:0:87
Request 7:0:7f
Job 3:0:0

        at edu.uci.ics.asterix.transaction.management.service.locking.ConcurrentLockManager.requestAbort(ConcurrentLockManager.java:896)
        at edu.uci.ics.asterix.transaction.management.service.locking.ConcurrentLockManager.enqueueWaiter(ConcurrentLockManager.java:175)
        at edu.uci.ics.asterix.transaction.management.service.locking.ConcurrentLockManager.lock(ConcurrentLockManager.java:151)
        at edu.uci.ics.asterix.transaction.management.opcallbacks.PrimaryIndexModificationOperationCallback.before(PrimaryIndexModificationOperationCallback.java:45)
        ... 15 more
edu.uci.ics.hyracks.api.exceptions.HyracksDataException: 
edu.uci.ics.asterix.common.exceptions.ACIDException: Transaction JID:14 should 
abort (requested by the Lock Manager):
timeout
        at edu.uci.ics.asterix.transaction.management.opcallbacks.PrimaryIndexModificationOperationCallback.before(PrimaryIndexModificationOperationCallback.java:47)
        at edu.uci.ics.hyracks.storage.am.btree.impls.BTree.upsert(BTree.java:331)
        at edu.uci.ics.hyracks.storage.am.btree.impls.BTree.access$500(BTree.java:69)
        at edu.uci.ics.hyracks.storage.am.btree.impls.BTree$BTreeAccessor.upsertIfConditionElseInsert(BTree.java:896)
        at edu.uci.ics.hyracks.storage.am.lsm.btree.impls.LSMBTree.insert(LSMBTree.java:400)
        at edu.uci.ics.hyracks.storage.am.lsm.btree.impls.LSMBTree.modify(LSMBTree.java:337)
        at edu.uci.ics.hyracks.storage.am.lsm.common.impls.LSMHarness.modify(LSMHarness.java:226)
        at edu.uci.ics.hyracks.storage.am.lsm.common.impls.LSMHarness.modify(LSMHarness.java:217)
        at edu.uci.ics.hyracks.storage.am.lsm.common.impls.LSMTreeIndexAccessor.insert(LSMTreeIndexAccessor.java:45)
        at edu.uci.ics.asterix.common.dataflow.AsterixLSMInsertDeleteOperatorNodePushable.nextFrame(AsterixLSMInsertDeleteOperatorNodePushable.java:57)
        at edu.uci.ics.asterix.metadata.feeds.FeedMetaOperatorDescriptor$FeedMetaNodePushable.nextFrame(FeedMetaOperatorDescriptor.java:204)
        at edu.uci.ics.hyracks.control.nc.Task.pushFrames(Task.java:316)
        at edu.uci.ics.hyracks.control.nc.Task.run(Task.java:269)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: edu.uci.ics.asterix.common.exceptions.ACIDException: Transaction 
JID:14 should abort (requested by the Lock Manager):
timeout
        at edu.uci.ics.asterix.transaction.management.service.locking.ConcurrentLockManager.requestAbort(ConcurrentLockManager.java:896)
        at edu.uci.ics.asterix.transaction.management.service.locking.ConcurrentLockManager.validateJob(ConcurrentLockManager.java:890)
        at edu.uci.ics.asterix.transaction.management.service.locking.ConcurrentLockManager.lock(ConcurrentLockManager.java:134)
        at edu.uci.ics.asterix.transaction.management.opcallbacks.PrimaryIndexModificationOperationCallback.before(PrimaryIndexModificationOperationCallback.java:45)
        ... 15 more

The issue seems sporadic. Not sure if it is the same issue that Zack/Young-Seok 
reported earlier. 
I'm using the lsm_storage_experiments branch for both hyracks and asterix. You 
could try to replicate it by using the following DDL:

drop dataverse experiments if exists;
create dataverse experiments;
use dataverse experiments;

create type TwitterUserType as closed {
    screen-name: string,
    lang: string,
    friends_count: int32,
    statuses_count: int32,
    name: string,
    followers_count: int32
}

create type TweetMessageType as closed {
    tweetid: int64,
    user: TwitterUserType,
    sender-location: point,
    send-time: datetime,
    referred-topics: {{ string }},
    message-text: string,

    btree-extra-field1: int32,
    btree-extra-field2: int32,
    btree-extra-field3: int32,
    btree-extra-field4: int32,
    btree-extra-field5: int32,
    btree-extra-field6: int32,
    btree-extra-field7: int32,
    btree-extra-field8: int32,
    dummy-size-adjuster: string
}

create dataset Tweets(TweetMessageType) primary key tweetid;

create feed TweetFeed1 using socket_adaptor
(
    ("sockets"="128.195.9.22:10001"),
    ("address-type"="IP"),
    ("type-name"="TweetMessageType"),
    ("format"="adm"),
    ("duration"="10")
);

create feed TweetFeed2 using socket_adaptor
(
    ("sockets"="128.195.9.22:10002"),
    ("address-type"="IP"),
    ("type-name"="TweetMessageType"),
    ("format"="adm"),
    ("duration"="10")
);

set wait-for-completion-feed "false";

connect feed TweetFeed1 to dataset Tweets;
connect feed TweetFeed2 to dataset Tweets;

I had a CC and one NC with 4 io devices. 3 of them are dedicated for data and 
one for logging.

Original issue reported on code.google.com by salsuba...@gmail.com on 29 May 2014 at 10:45

GoogleCodeExporter commented 9 years ago
I assume that the CC and the NC are on the same box, is that right?

Original comment by westm...@gmail.com on 30 May 2014 at 1:59

GoogleCodeExporter commented 9 years ago
mvn package on that branch hangs for me (after a number of NPEs in the log).
Is that expected?

Original comment by westm...@gmail.com on 30 May 2014 at 2:29

GoogleCodeExporter commented 9 years ago
I was using a separate machine for the CC.

For your second question, I actually never tried to run the tests on this 
branch. It is maintained mainly by Zack and Young-Seok. So I will let one of 
them answer this question.

Original comment by salsuba...@gmail.com on 30 May 2014 at 2:35

GoogleCodeExporter commented 9 years ago
There was some hacking that was required in order to get numbers so don't 
expect all of the tests to pass. If you give me the NPE, I can tell you whether 
or not that particular error is expected or not. My guess is that it is related 
to feeds and/or parsing.

Original comment by zheilb...@gmail.com on 30 May 2014 at 3:35

GoogleCodeExporter commented 9 years ago
NPEs look like this - they are available with different stacks on top of the 
IOManager

edu.uci.ics.hyracks.api.exceptions.HyracksDataException: 
java.lang.NullPointerException
    at edu.uci.ics.hyracks.storage.am.common.dataflow.IndexBulkLoadOperatorNodePushable.close(IndexBulkLoadOperatorNodePushable.java:92)
    at edu.uci.ics.hyracks.dataflow.std.sort.ExternalSortRunMerger.process(ExternalSortRunMerger.java:164)
    at edu.uci.ics.hyracks.dataflow.std.sort.ExternalSortOperatorDescriptor$MergeActivity$1.initialize(ExternalSortOperatorDescriptor.java:189)
    at edu.uci.ics.hyracks.api.rewriter.runtime.SuperActivityOperatorNodePushable.initialize(SuperActivityOperatorNodePushable.java:81)
    at edu.uci.ics.hyracks.control.nc.Task.run(Task.java:234)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at edu.uci.ics.hyracks.control.nc.io.IOManager.sync(IOManager.java:241)
    at edu.uci.ics.hyracks.storage.common.buffercache.BufferCache.force(BufferCache.java:787)
    at edu.uci.ics.hyracks.storage.am.lsm.common.impls.AbstractLSMIndex.forceFlushDirtyPages(AbstractLSMIndex.java:114)
    at edu.uci.ics.hyracks.storage.am.lsm.btree.impls.LSMBTree.markAsValid(LSMBTree.java:533)
    at edu.uci.ics.hyracks.storage.am.lsm.common.impls.LSMHarness.addBulkLoadedComponent(LSMHarness.java:345)
    at edu.uci.ics.hyracks.storage.am.lsm.btree.impls.LSMBTree$LSMBTreeBulkLoader.end(LSMBTree.java:606)
    at edu.uci.ics.hyracks.storage.am.common.dataflow.IndexBulkLoadOperatorNodePushable.close(IndexBulkLoadOperatorNodePushable.java:90)
    ... 7 more

Original comment by westm...@gmail.com on 30 May 2014 at 5:32

GoogleCodeExporter commented 9 years ago
Okay, this looks foreign to me. It doesn't look related to a change that we 
made for the experiments.

Original comment by zheilb...@gmail.com on 30 May 2014 at 6:14

GoogleCodeExporter commented 9 years ago
I think that the NPE is related to the IO profiling code that counts the number 
of buffer evictions.
As long as there is only one running NC in a machine, the NPE doesn't occur. 
However, the ExecutionTest uses two NCs, so it causes the NPE. 

Sattam may know more than me regarding the buffer eviction count code since he 
added the code there. :)

The following shows the corresponding code change links.
1. Hyracks:
https://code.google.com/p/hyracks/source/detail?r=0bd44d82c7a2bdaf1e36a58b53ca93
c896f5a402&name=lsm_storage_experiments
2. Asterix:
https://code.google.com/p/asterixdb/source/detail?r=9dee7f0d1b2b2c5afc70928562b9
d811e6b1687c&name=lsm_storage_experiments
--> Among the changed sources in Asterix, JobEventListenerFactory.java is the 
only source code directly related to the buffer eviction count.

Original comment by kiss...@gmail.com on 30 May 2014 at 4:03

GoogleCodeExporter commented 9 years ago
Ah, now I remember :-)
Yes, I added this code to collect performance numbers a while ago, and it is 
quite possible to cause errors when running the tests.

So please ignore the failing test cases.

Original comment by salsuba...@gmail.com on 30 May 2014 at 4:07

GoogleCodeExporter commented 9 years ago
Or if you prefer, you could locally revert the changes that Young-Seok shared 
for both hyracks and asterix.

Original comment by salsuba...@gmail.com on 30 May 2014 at 4:08

GoogleCodeExporter commented 9 years ago
Status: There has been some investigation, but I was not able to identify or 
reliably reproduce the issue.

Original comment by westm...@gmail.com on 30 Jul 2014 at 4:34