risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7.04k stars 579 forks source link

grpc request may hang when error message is too large after bumping to tonic v0.12 #18039

Open BugenZhao opened 3 months ago

BugenZhao commented 3 months ago

e2e-sink-test now consistently hangs here:

https://github.com/risingwavelabs/risingwave/blob/8d8c1360ce36ca919a7f784ed5fe1a4fac9bbb41/e2e_test/sink/kafka/avro.slt#L189-L196

I find that by disabling SCHEMA_REGISTRY_DEBUG here, this issue is gone.

https://github.com/risingwavelabs/risingwave/blob/8d8c1360ce36ca919a7f784ed5fe1a4fac9bbb41/ci/docker-compose.yml#L269

The only difference is that there won't be backtraces from the schema registry in the error message.

failed to validate sink: config error: all request confluent registry all timeout, req path ["subjects", "test-rw-sink-upsert-avro-err-key", "versions", "latest"], urls http://schemaregistry:8082/
    confluent schema registry error 40401: Subject 'test-rw-sink-upsert-avro-err-key' not found. io.confluent.rest.exceptions.RestNotFoundException: Subject 'test-rw-sink-upsert-avro-err-key' not found.
- io.confluent.rest.exceptions.RestNotFoundException: Subject 'test-rw-sink-upsert-avro-err-key' not found.
-   at io.confluent.kafka.schemaregistry.rest.exceptions.Errors.subjectNotFoundException(Errors.java:78)
-   at io.confluent.kafka.schemaregistry.rest.resources.SubjectVersionsResource.getSchemaByVersion(SubjectVersionsResource.java:154)
-   at jdk.internal.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)

Following this insight, I suppose it's because we always encode the ServerError in gRPC (HTTP2) headers (#13282), and there's an outstanding issue where tonic 0.12 will hang forever when the header size exceeds some limit.

https://github.com/risingwavelabs/risingwave/blob/8d8c1360ce36ca919a7f784ed5fe1a4fac9bbb41/src/error/src/tonic.rs#L68-L71

Upstream issues:

ATM there seems to be no fix. I'll disable SCHEMA_REGISTRY_DEBUG now as a workaround and open an issue for this.

Originally posted by @BugenZhao in https://github.com/risingwavelabs/risingwave/issues/17889#issuecomment-2287904683

BugenZhao commented 3 months ago

With this configuration exposed, we're able to workaround this issue:

https://github.com/hyperium/tonic/pull/1835

Waiting for a new version to be released.

BugenZhao commented 1 month ago

Workarounded with #18639