qubole / kinesis-sql

Kinesis Connector for Structured Streaming
http://www.qubole.com
Apache License 2.0
137 stars 80 forks source link

Cannot restart the streaming job from where it stopped previously in the kinesis stream #101

Open prashanthvg89 opened 3 years ago

prashanthvg89 commented 3 years ago

Hi,

I could not find a way to start the streaming job from where it left off previously in the kinesis stream even with using Spark streaming's "checkpointLocation" option.

TRIM_HORIZON starts all the way from the beginning and LATEST starts from the current timestamp. Let's say I stop (or restarts due to an error) the streaming job, how to restart from the offset where it left off? I believe this is possible with KCL 2.0 but "kinesis-sql" library only supports KCL 1.0. Is there a way to achieve this or not possible at all until we move to KCL 2.0?

Thanks, Prashanth

roncemer commented 2 years ago

I sent an email to the maintainer of this repo, and didn't get a response. For now, I've forked the project here https://github.com/roncemer/kinesis-spark-connector and updated it to build for Spark 3.2.1. Under Spark 3.2.1, it appears to be working correctly. I allowed it to go overnight with no new records being posted to the Kinesis data stream. When I started posting records again, the records arrived in Spark, and were processed.

I issued pull request https://github.com/qubole/kinesis-sql/pull/113/files from my forked repo back to the original qubole/kinesis-sql repo. Be sure to read the full description under the Comments tab as well.

I could use the help of anyone who might be interested in this project. Apparently, qubole/kinesis-sql is abandoned for about two years, and the main guy who was maintaining it doesn't respond to emails. If anyone can get in touch with someone who has control of this repo, please ask them to make me a committer. Barring that, I will have to publish the jar file from my forked repo as a maven resource. In any case, if I end up maintaining this project, either the original or forked repo, I will need volunteers to help handle the builds each time a new version of Spark is released.

In the mean time, feel free to build your own jar from my forked repo and run it under Spark 3.2.1. Also, let me know if it works under 3.3.x, or if you run into any other problems.

Thanks! Ron Cemer