Closed C0urante closed 8 months ago
Hi @Neenu1995! Sorry for the direct ping, but is there any chance you or someone else on this project could take a look?
Hi Chris,
Thanks for the question!
Detailed explanation:
The current API sorts the records by I/O system ingestion time, the row with largest I/O ingestion time 'wins'. For the example you presented, all the rows which are appended in the same request have the same I/O ingestion time. In theory, the system just pick the one randomly.
You might inquire about how to ensure the ordering from the client side: Option 1: streamWriter.append("key1", row1) streamWriter.append("key1", row2) The latter append will have larger I/O ingestion time.
Option 2: (not available yet, stay tuned) Clients can specify an ordering number on the client side, and the system will utilize the provided ordering to sort the records.
Please let me know if you have further questions.
Ah, neat! Thanks for the insight, this is really helpful.
Option 1 does seem feasible--some follow-up questions:
StreamWriter::append
call for each individual row?StreamWriter::append
without waiting on the resulting ApiFuture
instances, am I guaranteed that those requests will be processed in order? (Based on local testing, this doesn't seem to be the case 🙁)Option 2 would be lovely, please let me know if/when it lands! If you'd accept a small suggestion--it might be nice to have the stream writer automatically number rows in ascending order.
I should also note that pre-compaction (i.e., only sending the latest-available record for each primary key in each upstream batch to BigQuery) seems like a viable option too--thoughts?
Yes, this is not preferred due to performance reason.
What's the granularity of the ingestion time? Milliseconds.
Does this change the atomicity of insertions? It will, since it is a single request per row.
Pre-compaction would work perfectly with your case, but we should do that work for you if you just specify which row you want to pick per key by assigning the ordering as mentioned in the option 2.
Wonderful, thanks @anahan0369! I think because of the performance limitations and potential gotchas with timestamp granularity we'll use pre-compaction for now, but option 2 would definitely be fantastic if/when it's available.
I've noticed some interesting behavior when multiple rows with the same primary key are written in a CDC-enabled Storage Write API request. It seems like the first row for each given primary key is given precedence. It's debatable whether this is a bug, but it'd be nice if there could be some clarification on this behavior:
I ask because when consuming from, e.g., a stream of Kafka records generated by Debezium, this scenario may arise, and it seems like there are three options to deal with it:
Environment details
Steps to reproduce
Code example
External references such as API reference guides