create an issue - Githubissues

Very good article on idempotent and transactional exactly once. But I am hitting the following scenario where checkpoint itself is failing and cause partial writes. Can you take a look and suggest?

I am using spark streaming 1.3.0 directStream. I am hitting the following Scenario and I would like your suggestion on how to design atomicity. Here are pseudo codes to describe the flow and key points.

S1=createDirectStream(kafka) ==> I have OffsetRange associated with each RDD S1.print ==> good

S2=S1.flatMap(some transformation) ==> It does not require checkpoint S2.print ==> good

S3=S2.updateStateByKey(require checkpoint) ==> checkpoint failed due to hdfs issue for example S3.print ==> nothing print out

S2.foreachRDD { SaveToElasticSearch() ==> write to Elastic Search fine } S3.foreachRDD { SaveToElasticSearch() ==> nothing written to Elastic Search }

I was hoping the batch is atomic, i.e., as long as there are errors, offsets will not change and writes will not happen. But 2 issues I have observed: Kafka offsets kept moving on to next batch even there are dependent stream failed, e.g. S3. Partial writes went to Elastic Search.

We would like to see 1) the offset stops if anything in this job failed and spark streaming will recover by itself from the right offsets. 2) Write all streams in one unit.

Any suggestions?

Tian

tzhang101 / test-repo

create an issue #1