prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.05k stars 5.38k forks source link

Presto nodes crashed when we run some complicated sqls #4650

Closed knowledgehacker closed 8 years ago

knowledgehacker commented 8 years ago

Some Presto nodes crashed when we run some complicated sqls, and the crash causes the server to reboot. I switched CMS gc collector to g1 collector, but it did not solve the problem. And I tuned survivor/eden ratio, it mitigated the problem, but I am not sure it solved the problem.

--- Here is the information a. Environment Presto 0.83 and running on HDP 2.2.4, JDK1.7.0_75-b13, on Linux 2.6.32-504.el6.x86_64

b. Crash log ...

--------------- T H R E A D ---------------

Current thread (0x00007ffe4401f000): GCTaskThread [stack: 0x00007ffe3a6a7000,0x00007ffe3a7a8000] [id=12796]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x00000000000e6a59

...

Stack: [0x00007ffe3a6a7000,0x00007ffe3a7a8000], sp=0x00007ffe3a7a6270, free space=1020k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x83d02f] ParRootScanWithoutBarrierClosure::dooop(oopDesc*)+0x4f V [libjvm.so+0x51e9bf] frame::oops_compiled_argumentsdo(Symbol, bool, RegisterMap const, OopClosure)+0x27f V [libjvm.so+0x7edd83] nmethod::preserve_callee_argumentoops(frame, RegisterMap const, OopClosure_)+0x1c3 V [libjvm.so+0x51f1e3] frame::oops_dointernal(OopClosure, CodeBlobClosure, RegisterMap, bool)+0x143 V [libjvm.so+0x95507a] JavaThread::oopsdo(OopClosure, CodeBlobClosure_)+0x15a V [libjvm.so+0x956242] Threads::possibly_parallel_oopsdo(OopClosure, CodeBlobClosure_)+0x112 V [libjvm.so+0x8b1e59] SharedHeap::process_strongroots(bool, bool, SharedHeap::ScanningOption, OopClosure, CodeBlobClosure, OopsInGenClosure, bool)+0xa9 V [libjvm.so+0x56811c] GenCollectedHeap::gen_process_strongroots(int, bool, bool, bool, SharedHeap::ScanningOption, OopsInGenClosure, bool, OopsInGenClosure_)+0x7c V [libjvm.so+0x83a5df] ParNewGenTask::work(unsigned int)+0xef V [libjvm.so+0x9b76ef] GangWorker::loop()+0xcf V [libjvm.so+0x820b88] java_start(Thread*)+0x108

JavaThread 0x00007ffe4537a000 (nid = 13202) was being processed Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) v ~RuntimeStub::resolve_static_call J 18233 C2 com.facebook.presto.$gen.PageProcessor_1967.project_0(Lcom/facebook/presto/spi/ConnectorSession;Lcom/facebook/presto/spi/block/Block;ILcom/facebook/presto/spi/block/BlockBuilder;)V (908 bytes) @ 0x00007ffe3d76899c [0x00007ffe3d7686e0+0x2bc] j com.facebook.presto.$gen.PageProcessor_1967.process(Lcom/facebook/presto/spi/ConnectorSession;Lcom/facebook/presto/spi/Page;IILcom/facebook/presto/spi/PageBuilder;)I+222 J 7607 C2 com.facebook.presto.operator.FilterAndProjectOperator.getOutput()Lcom/facebook/presto/spi/Page; (129 bytes) @ 0x00007ffe3bf14b68 [0x00007ffe3bf14640+0x528] J 6659 C2 com.facebook.presto.operator.Driver.processInternal()Lcom/google/common/util/concurrent/ListenableFuture; (272 bytes) @ 0x00007ffe3bb96f84 [0x00007ffe3bb967c0+0x7c4] J 13660 C2 com.facebook.presto.operator.Driver.processFor(Lio/airlift/units/Duration;)Lcom/google/common/util/concurrent/ListenableFuture; (251 bytes) @ 0x00007ffe3cd795d8 [0x00007ffe3cd79160+0x478] J 16225 C2 com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(Lio/airlift/units/Duration;)Lcom/google/common/util/concurrent/ListenableFuture; (67 bytes) @ 0x00007ffe3d2b4b6c [0x00007ffe3d2b4ac0+0xac] J 16241 C2 com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process()Lcom/google/common/util/concurrent/ListenableFuture; (136 bytes) @ 0x00007ffe3b39fb94 [0x00007ffe3b39f980+0x214] J 15556 C2 com.facebook.presto.execution.TaskExecutor$Runner.run()V (583 bytes) @ 0x00007ffe3d22735c [0x00007ffe3d226d20+0x63c] j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95 j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 j java.lang.Thread.run()V+11 v ~StubRoutines::call_stub

...

Heap par new generation total 1380160K, used 1298596K [0x00000001f6a00000, 0x0000000254390000, 0x0000000254390000) eden space 1226816K, 100% used [0x00000001f6a00000, 0x0000000241810000, 0x0000000241810000) from space 153344K, 46% used [0x0000000241810000, 0x0000000245e29370, 0x000000024add0000) to space 153344K, 79% used [0x000000024add0000, 0x000000025249ccb0, 0x0000000254390000) concurrent mark-sweep generation total 23632320K, used 13056651K [0x0000000254390000, 0x00000007f6a00000, 0x00000007f6a00000) concurrent-mark-sweep perm gen total 153600K, used 128447K [0x00000007f6a00000, 0x0000000800000000, 0x0000000800000000)

Card table byte_map: [0x00007ffdf0fb4000,0x00007ffdf4000000] byte_map_base: 0x00007ffdeffff000

nezihyigitbasi commented 8 years ago

Did you try upgrading to Java 8? As a side note you should also consider upgrading to a more recent presto version. On Thu, Feb 25, 2016 at 10:14 PM ming lin notifications@github.com wrote:

Some Presto nodes crashed when we run some complicated sqls. I switched CMS gc collector to g1 collector, but it did not solve the problem. And I tuned survivor/eden ratio, it mitigated the problem, but I am not sure it solved the problem.

--- Here is the information a. Environment Presto 0.83 and running on HDP 2.2.4, JDK1.7u75, on Linux 2.6.32-504.el6.x86_64

b. Crash log # A fatal error has been detected by the Java Runtime Environment:

# SIGSEGV (0xb) at pc=0x00007ffe4921602f, pid=12794, tid=140729879525120

# JRE version: Java(TM) SE Runtime Environment (7.0_75-b13) (build 1.7.0_75-b13) Java VM: Java HotSpot(TM) 64-Bit Server VM (24.75-b04 mixed mode linux-amd64 compressed oops) Problematic frame: V [libjvm.so+0x83d02f] ParRootScanWithoutBarrierClosure::do_oop(oopDesc**)+0x4f

# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

# If you would like to submit a bug report, please visit: http://bugreport.sun.com/bugreport/crash.jsp

#

--------------- T H R E A D ---------------

Current thread (0x00007ffe4401f000): GCTaskThread [stack: 0x00007ffe3a6a7000,0x00007ffe3a7a8000] [id=12796]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x00000000000e6a59

...

Stack: [0x00007ffe3a6a7000,0x00007ffe3a7a8000], sp=0x00007ffe3a7a6270, free space=1020k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x83d02f] ParRootScanWithoutBarrierClosure::dooop(oopDesc* )+0x4f V [libjvm.so+0x51e9bf] frame::oops_compiled_argumentsdo(Symbol, bool, RegisterMap const, OopClosure)+0x27f V [libjvm.so+0x7edd83] nmethod::preserve_callee_argumentoops(frame, RegisterMap const, OopClosure_)+0x1c3 V [libjvm.so+0x51f1e3] frame::oops_dointernal(OopClosure, CodeBlobClosure, RegisterMap , bool)+0x143 V [libjvm.so+0x95507a] JavaThread::oopsdo(OopClosure, CodeBlobClosure _)+0x15a V [libjvm.so+0x956242] Threads::possibly_parallel_oopsdo(OopClosure, CodeBlobClosure _)+0x112 V [libjvm.so+0x8b1e59] SharedHeap::process_strongroots(bool, bool, SharedHeap::ScanningOption, OopClosure, CodeBlobClosure, OopsInGenClosure, bool)+0xa9 V [libjvm.so+0x56811c] GenCollectedHeap::gen_process_strongroots(int, bool, bool, bool, SharedHeap::ScanningOption, OopsInGenClosure, bool, OopsInGenClosure_)+0x7c V [libjvm.so+0x83a5df] ParNewGenTask::work(unsigned int)+0xef V [libjvm.so+0x9b76ef] GangWorker::loop()+0xcf V [libjvm.so+0x820b88] java_start(Thread*)+0x108

JavaThread 0x00007ffe4537a000 (nid = 13202) was being processed Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) v ~RuntimeStub::resolve_static_call J 18233 C2 com.facebook.presto.$gen.PageProcessor_1967.project_0(Lcom/facebook/presto/spi/ConnectorSession;Lcom/facebook/presto/spi/block/Block;ILcom/facebook/presto/spi/block/BlockBuilder;)V (908 bytes) @ 0x00007ffe3d76899c [0x00007ffe3d7686e0+0x2bc] j com.facebook.presto.$gen.PageProcessor_1967.process(Lcom/facebook/presto/spi/ConnectorSession;Lcom/facebook/presto/spi/Page;IILcom/facebook/presto/spi/PageBuilder;)I+222 J 7607 C2 com.facebook.presto.operator.FilterAndProjectOperator.getOutput()Lcom/facebook/presto/spi/Page; (129 bytes) @ 0x00007ffe3bf14b68 [0x00007ffe3bf14640+0x528] J 6659 C2 com.facebook.presto.operator.Driver.processInternal()Lcom/google/common/util/concurrent/ListenableFuture; (272 bytes) @ 0x00007ffe3bb96f84 [0x00007ffe3bb967c0+0x7c4] J 13660 C2 com.facebook.presto.operator.Driver.processFor(Lio/airlift/units/Duration;)Lcom/google/common/util/concurrent/ListenableFuture; (251 bytes) @ 0x00007ffe3cd795d8 [0x00007ffe3cd79160+0x478] J 16225 C2 com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(Lio/airlift/units/Duration;)Lcom/google/common/util/concurrent/ListenableFuture; (67 bytes) @ 0x00007ffe3d2b4b6c [0x00007ffe3d2b4ac0+0xac] J 16241 C2 com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process()Lcom/google/common/util/concurrent/ListenableFuture; (136 bytes) @ 0x00007ffe3b39fb94 [0x00007ffe3b39f980+0x214] J 15556 C2 com.facebook.presto.execution.TaskExecutor$Runner.run()V (583 bytes) @ 0x00007ffe3d22735c [0x00007ffe3d226d20+0x63c] j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95 j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 j java.lang.Thread.run()V+11 v ~StubRoutines::call_stub

...

Heap par new generation total 1380160K, used 1298596K [0x00000001f6a00000, 0x0000000254390000, 0x0000000254390000) eden space 1226816K, 100% used [0x00000001f6a00000, 0x0000000241810000, 0x0000000241810000) from space 153344K, 46% used [0x0000000241810000, 0x0000000245e29370, 0x000000024add0000) to space 153344K, 79% used [0x000000024add0000, 0x000000025249ccb0, 0x0000000254390000) concurrent mark-sweep generation total 23632320K, used 13056651K [0x0000000254390000, 0x00000007f6a00000, 0x00000007f6a00000) concurrent-mark-sweep perm gen total 153600K, used 128447K [0x00000007f6a00000, 0x0000000800000000, 0x0000000800000000)

Card table byte_map: [0x00007ffdf0fb4000,0x00007ffdf4000000] byte_map_base: 0x00007ffdeffff000

— Reply to this email directly or view it on GitHub https://github.com/facebook/presto/issues/4650.

knowledgehacker commented 8 years ago

Thanks for your comment. We are planning to upgrade Presto, but current version will run for a short time, we need to fix it during this period. I can't tell exactly it is an internal bug in JVM or some native method in Presto causes the crash. I plan to try new Java if we can't solve the problem.

nezihyigitbasi commented 8 years ago

There are similar jvm bug reports that's why I recommended you to upgrade your Java version first. On Fri, Feb 26, 2016 at 7:33 AM ming lin notifications@github.com wrote:

Thanks for your comment. We are planning to upgrade Presto, but current version will run for a short time, we need to fix it during this period. I can't tell exactly it is an internal bug in JVM or some native method in Presto causes the crash. I plan to try new Java if we can't solve the problem.

— Reply to this email directly or view it on GitHub https://github.com/facebook/presto/issues/4650#issuecomment-189324347.

electrum commented 8 years ago

We saw JVM crashes before 8u40.

knowledgehacker commented 8 years ago

Thanks @nezihyigitbasi. I will try it. @electrum, is there possible some native methods in Presto causes the issue? It's strange that when the crash happened, it always stop at PageProcessor or CursorProcessor.

martint commented 8 years ago

No, that crash is inside a GC thread, so it's unlikely to be native code in Presto.

@knowledgehacker, can you paste the crash logs when you switch to G1? BTW, there have been a lot of fixes to GC since Java 7, so I suggest you try with a newer version of Java.

knowledgehacker commented 8 years ago

@martint I tried jdk1.8.0 u65, and it still crashes, and we get the following error from server.log, maybe there is something wrong with the memory(a hardware issue). Any thoughts?

2016-02-25T10:16:55.002+0000 WARN http-worker-991 com.facebook.presto.server.ThrowableMapper Request failed for /v1/task/20160225_101652_00046_q93jg.1.4 java.lang.NullPointerException: null at com.facebook.presto.execution.TaskExecutor$TaskHandle.access$1300(TaskExecutor.java:288) ~[presto-main-0.83-fw0.8.2.jar:0.83-fw0.8.2] at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.(TaskExecutor.java:400) ~[presto-main-0.83-fw0.8.2.jar:0.83-fw0.8.2] at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.(TaskExecutor.java:373) ~[presto-main-0.83-fw0.8.2.jar:0.83-fw0.8.2] at com.facebook.presto.execution.TaskExecutor.enqueueSplits(TaskExecutor.java:191) ~[presto-main-0.83-fw0.8.2.jar:0.83-fw0.8.2] at com.facebook.presto.execution.SqlTaskExecution.enqueueDrivers(SqlTaskExecution.java:308) ~[presto-main-0.83-fw0.8.2.jar:0.83-fw0.8.2] at com.facebook.presto.execution.SqlTaskExecution.updateSources(SqlTaskExecution.java:270) ~[presto-main-0.83-fw0.8.2.jar:0.83-fw0.8.2] at com.facebook.presto.execution.SqlTaskExecution.addSources(SqlTaskExecution.java:222) ~[presto-main-0.83-fw0.8.2.jar:0.83-fw0.8.2] at com.facebook.presto.execution.SqlTaskExecution.createSqlTaskExecution(SqlTaskExecution.java:119) ~[presto-main-0.83-fw0.8.2.jar:0.83-fw0.8.2] at com.facebook.presto.execution.SqlTaskExecutionFactory.create(SqlTaskExecutionFactory.java:101) ~[presto-main-0.83-fw0.8.2.jar:0.83-fw0.8.2] at com.facebook.presto.execution.SqlTask.updateTask(SqlTask.java:219) ~[presto-main-0.83-fw0.8.2.jar:0.83-fw0.8.2] at com.facebook.presto.execution.SqlTaskManager.updateTask(SqlTaskManager.java:223) ~[presto-main-0.83-fw0.8.2.jar:0.83-fw0.8.2] at com.facebook.presto.server.TaskResource.createOrUpdateTask(TaskResource.java:104) ~[presto-main-0.83-fw0.8.2.jar:0.83-fw0.8.2] at sun.reflect.GeneratedMethodAccessor215.invoke(Unknown Source) ~[na:na] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_75] at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_75] ...

knowledgehacker commented 8 years ago

This is probably a bug in Presto. The symptom is somewhat different when Presto crashes. For this time's crash, it is probably caused by a bug in Presto(NullPointerException). I didn't get a crash log since the time I got the one pasted. Any thoughts?

knowledgehacker commented 8 years ago

I see when a SqlTaskExecution is constructed:

        if (!taskStateMachine.getState().isDone()) {
            taskHandle = taskExecutor.addTask(taskId);
            taskStateMachine.addStateChangeListener(new RemoveTaskHandleWhenDone(taskExecutor, taskHandle));
        }
        else {
            taskHandle = null;
        }

Then when enqueueing a split: public synchronized List<ListenableFuture<?>> enqueueSplits(TaskHandle taskHandle, boolean forceStart, List<? extends SplitRunner> taskSplits) { List<ListenableFuture<?>> finishedFutures = new ArrayList<>(taskSplits.size()); for (SplitRunner taskSplit : taskSplits) { PrioritizedSplitRunner prioritizedSplitRunner = new PrioritizedSplitRunner(taskHandle, taskSplit, ticker); ...

_While in PrioritizedSplitRunner's constructor: _

    private PrioritizedSplitRunner(TaskHandle taskHandle, SplitRunner split, Ticker ticker)
    {
        this.taskHandle = taskHandle;
        this.splitId = taskHandle.getNextSplitId();

So it seems that the taskHandle is a null pointer when it is referenced in constructor "PriopritizedSplitRunner".

knowledgehacker commented 8 years ago

Any comment on it?

dain commented 8 years ago

The version you are using 0.83 was released on Nov 12, 2014. We have fixed tons of bugs in the last year and half, so I am not surprised that you seeing problems.

Please, upgrade to the latest release and try again. If you still have problems, please open a new issue.