onyx-platform / onyx

Distributed, masterless, high performance, fault tolerant data processing
http://www.onyxplatform.org
Eclipse Public License 1.0
2.05k stars 205 forks source link

IndexOutOfBoundsException from aeron #900

Open jgerman opened 5 years ago

jgerman commented 5 years ago

I tried upgrading our Onyx system to 0.14.6 this morning and I'm getting errors on startup, it looks to be every task blowing up. There are two versions of the error:

19-10-22 20:16:14 robert-downey-jr-master-5c869697c5-dkm7w WARN [onyx.messaging.aeron.status-publisher:40] - Aeron status channel error
                                         java.lang.Thread.run                      Thread.java:  748
                        org.agrona.concurrent.AgentRunner.run                 AgentRunner.java:  164
                org.agrona.concurrent.AgentRunner.doDutyCycle                 AgentRunner.java:  283
                              io.aeron.ClientConductor.doWork             ClientConductor.java:  191
                             io.aeron.ClientConductor.service             ClientConductor.java:  896
                         io.aeron.DriverEventsAdapter.receive         DriverEventsAdapter.java:   63
org.agrona.concurrent.broadcast.CopyBroadcastReceiver.receive       CopyBroadcastReceiver.java:  116
                       io.aeron.DriverEventsAdapter.onMessage         DriverEventsAdapter.java:  123
   io.aeron.command.ImageBuffersReadyFlyweight.sourceIdentity  ImageBuffersReadyFlyweight.java:  239
            org.agrona.concurrent.UnsafeBuffer.getStringAscii                UnsafeBuffer.java: 1085
            org.agrona.concurrent.UnsafeBuffer.getStringAscii                UnsafeBuffer.java: 1134
              org.agrona.concurrent.UnsafeBuffer.boundsCheck0                UnsafeBuffer.java: 1716
java.lang.IndexOutOfBoundsException: index=124 length=822083584 capacity=4096

and specific task versions in poll-recover:

19-10-22 20:16:14 robert-downey-jr-master-5c869697c5-dkm7w WARN [onyx.peer.task-lifecycle:177] -
                                           java.lang.Thread.run                      Thread.java:  748
             java.util.concurrent.ThreadPoolExecutor$Worker.run          ThreadPoolExecutor.java:  624
              java.util.concurrent.ThreadPoolExecutor.runWorker          ThreadPoolExecutor.java: 1149
                                                            ...
                              clojure.core.async/thread-call/fn                        async.clj:  434
              onyx.peer.task-lifecycle/start-task-lifecycle!/fn               task_lifecycle.clj: 1155
                   onyx.peer.task-lifecycle/run-task-lifecycle!               task_lifecycle.clj:  551
        onyx.peer.task-lifecycle.TaskStateMachine/next-replica!               task_lifecycle.clj:  961
           onyx.messaging.messenger-state/next-messenger-state!              messenger_state.clj:   92
            onyx.messaging.messenger-state/transition-messenger              messenger_state.clj:   83
onyx.messaging.aeron.messenger.AeronMessenger/update-publishers                    messenger.clj:  112
           onyx.messaging.aeron.messenger/transition-publishers                    messenger.clj:   51
                                          clojure.core/group-by                         core.clj: 7146
                                            clojure.core/reduce                         core.clj: 6828
                                    clojure.core.protocols/fn/G                    protocols.clj:   13
                                      clojure.core.protocols/fn                    protocols.clj:   75
                              clojure.core.protocols/seq-reduce                    protocols.clj:   24
                                               clojure.core/seq                         core.clj:  137
                                                            ...
                                           clojure.core/keep/fn                         core.clj: 7341
        onyx.messaging.aeron.messenger/transition-publishers/fn                    messenger.clj:   50
                   onyx.messaging.aeron.publisher/reconcile-pub                    publisher.clj:  291
                 onyx.messaging.aeron.publisher.Publisher/start                    publisher.clj:  198
      onyx.messaging.aeron.endpoint-status.EndpointStatus/start              endpoint_status.clj:   79
                                 io.aeron.Aeron.addSubscription                       Aeron.java:  263
                       io.aeron.ClientConductor.addSubscription             ClientConductor.java:  495
                       io.aeron.ClientConductor.addSubscription             ClientConductor.java:  521
                         io.aeron.ClientConductor.awaitResponse             ClientConductor.java:  945
                               io.aeron.ClientConductor.service             ClientConductor.java:  896
                           io.aeron.DriverEventsAdapter.receive         DriverEventsAdapter.java:   63
  org.agrona.concurrent.broadcast.CopyBroadcastReceiver.receive       CopyBroadcastReceiver.java:  116
                         io.aeron.DriverEventsAdapter.onMessage         DriverEventsAdapter.java:  123
     io.aeron.command.ImageBuffersReadyFlyweight.sourceIdentity  ImageBuffersReadyFlyweight.java:  239
              org.agrona.concurrent.UnsafeBuffer.getStringAscii                UnsafeBuffer.java: 1085
              org.agrona.concurrent.UnsafeBuffer.getStringAscii                UnsafeBuffer.java: 1134
                org.agrona.concurrent.UnsafeBuffer.boundsCheck0                UnsafeBuffer.java: 1716
java.lang.IndexOutOfBoundsException: index=124 length=808517632 capacity=4096
         clojure.lang.ExceptionInfo: Handling uncaught exception thrown inside task lifecycle :lifecycle/poll-recover. Killing the job. -> Exception type: java.lang.IndexOutOfBoundsException. Exception message: index=124 length=808517632 capacity=4096
       job-id: #uuid "00000000-0000-0000-0000-000000000003"
     metadata: {:job-id #uuid "00000000-0000-0000-0000-000000000003", :job-hash "7ba27abbd73fa66ec2351c328b997173d84067333d334ca41584c39e0669f"}
      peer-id: #uuid "3236c6dd-c980-caac-12e3-d339f7c564ad"
    task-name: :prepare-pending-state-tx

I'm not even quite sure how to begin figuring out what's going wrong here. I started poking around but haven't made much headway.

The problem doesn't occur in 0.14.5, and I see aeron was upgraded in 0.14.6. We do set a large term buffer length (-Daeron.term.buffer.length=8388608) which is less than the length in the thrown exception. I've tried without that setting as well, but the errors still occur.

Does anyone have any advice?

DVious commented 2 years ago

No advice!

I wonder if you managed to resolve this issue and, more generally, whether onyx is still being supported, or its goals pursued by some part of the community?

Any advice?!

Regs,

neuromantik33 commented 2 years ago

I think onyx is EOL for a while now, probably best switch to Kafka Streams with jackdaw+willa

DVious commented 2 years ago

Thanks, fren!