Open adamlbailey opened 4 years ago
Update:
Using timestamp instead of iso formatted strings solves the problem. Would be nice to support iso strings as well for readability and compatibility with the options afforded by other kinesis libraries. Will consider adding a PR to address this soon.
{
"metadata": {
"streamName": "QSR-data-stream-production",
"batchId": "1"
},
"shardId-000000000002": {
"iteratorType": "AT_TIMESTAMP",
"iteratorPosition": "1601495926480"
},
"shardId-000000000003": {
"iteratorType": "AT_TIMESTAMP",
"iteratorPosition": "1601495926480"
},
"shardId-000000000004": {
"iteratorType": "AT_TIMESTAMP",
"iteratorPosition": "1601495926480"
},
"shardId-000000000005": {
"iteratorType": "AT_TIMESTAMP",
"iteratorPosition": "1601495926480"
}
}
@adamlbailey - Looking forward to the PR.
Hi @adamlbailey,
I am trying to read the data from Kinesis using at_timestamp
as option in startingposition
.
Here is my below piece of code to achieve this
pos = json.dumps({"at_timestamp": "02/26/2021 3:07:13 PDT"})
kinesisDF = spark.readStream.format("kinesis").option("streamName", name).option("endpointUrl", URL).option("awsAccessKeyId",key).option("awsSecretKey",sKey).option("startingposition",pos).load()
Here is the ERROR message I am receiving
pyspark.sql.utils.IllegalArgumentException: 'org.json4s.package$MappingException: Expected object but got JString(02/26/2021 3:07:13 PDT)'
I am new to use this kinesis connector and I know the way I am passing value for the starting position is wrong, could you help me how to pass the at_timestamp
as the value for the startposition
option.
Thanks in Advance!
Hi @gopi-t2s, I'm somewhat removed from this work now but if memory serves:
You're going to want to construct an object exemplified in my previous comment.. Practically, I did this by writing a helper that described the stream so I could list each shard with the right timestamp Long value.
Thanks @adamlbailey for your inputs..
I ran into this as well @gopi-t2s - were you able to make it work? I was unsure if pyspark was going to be supported for this,
No @chadlagore, I am still looking for the ways to attain this..
I got it working with the upper example from @adamlbailey
minimal example in pyspark:
now_ts = datetime.now().strftime("%s") + "000" # timestamp in epoch time format, e.g. "1601495926000"
from_timestamp = {
"metadata": {
"streamName": "my-stream",
"batchId": "1"
},
"shardId-000000000000": {
"iteratorType": "AT_TIMESTAMP",
"iteratorPosition": now_ts
}
}
starting_position = json.dumps(from_timestamp)
my_stream = (spark
.readStream
.format('kinesis')
.option('streamName', "my-stream")
.option('endpointUrl', KINESIS_ENDPOINT)
.option('region', KINESIS_REGION)
.option('startingposition', starting_position)
hope this helps @chadlagore @gopi-t2s
Thank you @nikitira , I will try this..
Excellent addition for reading from stream at specific positions per #78
However, I'm having trouble using the "AT_TIMESTAMP" option to read from shards at specific timestamps.
The shardInfo object I'm using is below:
Here is the exception: