Very good article on idempotent and transactional exactly once.
But I am hitting the following scenario where checkpoint itself is failing and cause partial writes.
Can you take a look and suggest?
I am using spark streaming 1.3.0 directStream. I am hitting the following Scenario and I would like your suggestion on how to design atomicity.
Here are pseudo codes to describe the flow and key points.
S1=createDirectStream(kafka) ==> I have OffsetRange associated with each RDD
S1.print ==> good
S2=S1.flatMap(some transformation) ==> It does not require checkpoint
S2.print ==> good
S3=S2.updateStateByKey(require checkpoint) ==> checkpoint failed due to hdfs issue for example
S3.print ==> nothing print out
S2.foreachRDD {
SaveToElasticSearch() ==> write to Elastic Search fine
}
S3.foreachRDD {
SaveToElasticSearch() ==> nothing written to Elastic Search
}
I was hoping the batch is atomic, i.e., as long as there are errors, offsets will not change and writes will not happen.
But 2 issues I have observed:
Kafka offsets kept moving on to next batch even there are dependent stream failed, e.g. S3.
Partial writes went to Elastic Search.
We would like to see
1) the offset stops if anything in this job failed and spark streaming will recover by itself from the right offsets.
2) Write all streams in one unit.
Very good article on idempotent and transactional exactly once. But I am hitting the following scenario where checkpoint itself is failing and cause partial writes. Can you take a look and suggest?
I am using spark streaming 1.3.0 directStream. I am hitting the following Scenario and I would like your suggestion on how to design atomicity. Here are pseudo codes to describe the flow and key points.
S1=createDirectStream(kafka) ==> I have OffsetRange associated with each RDD S1.print ==> good
S2=S1.flatMap(some transformation) ==> It does not require checkpoint S2.print ==> good
S3=S2.updateStateByKey(require checkpoint) ==> checkpoint failed due to hdfs issue for example S3.print ==> nothing print out
S2.foreachRDD { SaveToElasticSearch() ==> write to Elastic Search fine } S3.foreachRDD { SaveToElasticSearch() ==> nothing written to Elastic Search }
I was hoping the batch is atomic, i.e., as long as there are errors, offsets will not change and writes will not happen. But 2 issues I have observed: Kafka offsets kept moving on to next batch even there are dependent stream failed, e.g. S3. Partial writes went to Elastic Search.
We would like to see 1) the offset stops if anything in this job failed and spark streaming will recover by itself from the right offsets. 2) Write all streams in one unit.
Any suggestions?
Tian