twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

ArrayIndexOutOfBoundsException in Cascading #1794

Open fwbrasil opened 6 years ago

fwbrasil commented 6 years ago

One of our e2e tests fails when I try to use the the develop branch:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
    at cascading.tuple.TupleEntryChainIterator.next(TupleEntryChainIterator.java:79)
    at cascading.tuple.TupleEntryChainIterator.next(TupleEntryChainIterator.java:32)
    at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at com.twitter.scalding.typed.cascading_backend.AsyncFlowDefRunner$$anonfun$getIterable$1$$anon$1.foreach(AsyncFlowDefRunner.scala:360)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at com.twitter.scalding.typed.cascading_backend.AsyncFlowDefRunner$$anonfun$getIterable$1$$anon$1.map(AsyncFlowDefRunner.scala:360)
    at com.twitter.data_platform.e2e_testing.jobs.dal_keyval_source_summingbird.VerifyResultsExecutionApp$$anonfun$3.apply(VKVSTest.scala:104)
    at com.twitter.data_platform.e2e_testing.jobs.dal_keyval_source_summingbird.VerifyResultsExecutionApp$$anonfun$3.apply(VKVSTest.scala:102)
    at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)

Considering that Iterator.foreach checks if hasNext before calling next, it seems that TupleEntryChainIterator enters a bad state where currentIterator points to an invalid position.

I haven't been able to reproduce the cascading bug in isolation yet.

cc/ @johnynek

johnynek commented 6 years ago

I wonder if the source you are dealing with has a bug with toIterator? We assume we can call that again and again, but maybe this source has an issue there?

fwbrasil commented 6 years ago

It seems to be a bug in cascading. TupleEntryChainIterator should never throw if used correctly (hasNext and then next), which is the case.

johnynek commented 6 years ago

I wonder if it is exhibited in cascading 2.7?

johnynek commented 6 years ago

also: why did we not trigger it before, but now we do?

johnynek commented 6 years ago

I'd love to find a repro of this issue.

fwbrasil commented 6 years ago

I've investigated this issue a little more. The bug is not in TupleEntryChainIterator, but in the underlying iterator impl HadoopTupleEntrySchemeIterator. Its hasNext returns true initially but a second call to hasNext returns false, even before next is called.

johnynek commented 6 years ago

@fwbrasil is this a race condition in Hadoop? we have seen a few of what looks like those.