Closed samfadrigalan closed 3 years ago
Just to avoid any confusion, this problem is not related to distribution implementation :).
About these suggestions :
- Ensure that init is indeed broken. Make sure configure is set to false during startup.
Good call @ao2017 It turns out the column family creation works but we just have to pass the proper config variable and set configure
to true. I created a PR to the internal config repo.
I will repurpose this ticket to address the exception handling.
@samfadrigalan - some evidence lending import to this issue:
I have concerns about moving to late acks. Right now unprocessable messages get dropped - if we switch to late-ack, those go back to PubSub and will be continuously retried. A relatively small amount of unprocessable messages would plausibly lockup the system. If we know that the messages are bad we could ack them despite not successfully processing them, but if we do that we now have a system where only known failures are safely handled and unknown failures cause queue buildup. I think in general we let messages blow up when we don't know what to do with them and only retry if we're very certain its safe to do so.
For that concern, can we use dead-letter queues (https://cloud.google.com/pubsub/docs/dead-letter-topics)? This way we will have visibility on the failures and not have any data loss without clogging up the main queues. Because we have copies of the failed messages, we could easily reproduce the failures in a non-prod pipeline. We could then redirect the messages back to the main consumer processing once we've released a fix.
Yeah that sounds like a good plan, the visibility would be great. I wonder how useful redirecting will be in practice though - if we need to make a code change and redeploy, the queue could become way too big to process.
So it turns out PubSub dead letters only after a minimum of 5 delivery attempts and can't be configured for less than that. If there's an issue like the column family missing that spawned this, the cluster ends up doing 5x as much work before rejecting the message.
I'm working on improving the logging and tracing so there is better visibility for write failures, but I'm planning to leave the PubSub acking as-is.
I'm having trouble reproing the lack of logs. I tried running IT tests without the column family created, and the exception was very visibly logged:
SEVERE: Could not complete RPC. Failure #0, got: Status{code=NOT_FOUND, description=table "projects/fake/instances/fake/tables/heroic_it_ff0cb89d-c12b-4f43-9a48-7aad4b9d778c" not found, cause=null} on channel 65.
Trailers: Metadata(content-type=application/grpc,bigtable-channel-id=65)
java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: INTERNAL: unknown family "points"
/* full exception omitted for brevity */
I'm having trouble reproing the lack of logs. I tried running IT tests without the column family created, and the exception was very visibly logged:
SEVERE: Could not complete RPC. Failure #0, got: Status{code=NOT_FOUND, description=table "projects/fake/instances/fake/tables/heroic_it_ff0cb89d-c12b-4f43-9a48-7aad4b9d778c" not found, cause=null} on channel 65. Trailers: Metadata(content-type=application/grpc,bigtable-channel-id=65) java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: INTERNAL: unknown family "points" /* full exception omitted for brevity */
Were you able to reproduce that log in staging or locally? I wonder if there is a difference between the setup in the IT tests and running the consumers as is in prod or locally (i.e. Are the grpc calls in the production consumer flow wrapped in async functions while the IT calls aren't?). I noticed that log does not exactly look like the stack trace on the description. I had to set up the consumers locally and evaluate expressions in debug mode to get that stack trace as there were no cloud error logs or in the local set up when the consumer tried to process messages.
It should be the same flow - the writes are in a flush
call that happens on a timer, its not tied to the actual request. I'll try running the consumers locally with the BT emulator and see what happens.
DoD
Heroic should properly address exception handling.
Background
Heroic bigtable consumer failed to write to bigtable when the new column family was not created yet. There was no exception logged and the consumer ack-ed the message as if the write was successful. I got the exception below by hacky debugger evaluations.