redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.39k stars 577 forks source link

Pandaproxy: register and validate schema using the schema registry #6370

Open jrkinley opened 1 year ago

jrkinley commented 1 year ago

Pandaproxy does not provide the ability to validate incoming messages using Avro, Protobuf, or JSON schema stored in the schema registry.

This feature request is to add support for the value_schema and value_schema_id fields so that HTTP clients can post either the full schema or an existing schema ID alongside the records. Pandaproxy shall register the provided schema with the schema registry (in the case of value_schema) or retrieve the existing schema (in the case of value_schema_id) and use the schema to validate and serialise messages before storing them in Redpanda.

For example, this curl command should result in the value_schema schema being registered in the schema registry and used to validate and serialise the list of records before storing them in Redpanda. Pandaproxy should include the value_schema_id in the response:

curl -X POST -H "Content-Type: application/vnd.kafka.avro.v2+json" \
     --data '{ \
         "value_schema": "{\"type\": \"record\", \"name\": \"user\", \"fields\": [{\"name\": \"name\", \"type\": \"string\"}]}", \ 
         "records": [{"value": {"name": "james"}}]}' \
     "http://localhost:8082/topics/avrotest"

{"offsets":[{"partition": 0, "offset": 0, "error_code": null, "error": null}], "key_schema_id": null, "value_schema_id": 1}

In subsequent messages only {"value_schema_id": 1, "records":[...]} need be provided and Pandaproxy will fetch the corresponding schema from the schema registry if it isn't cached.

JIRA Link: CORE-1018

thesammy2010 commented 1 year ago

https://docs.confluent.io/platform/current/tutorials/examples/clients/docs/rest-proxy.html#consume-avro-records (point 8)

jrkinley commented 1 year ago

https://docs.confluent.io/platform/current/tutorials/examples/clients/docs/rest-proxy.html#consume-avro-records (point 8)

The consumer section only has points 1-5. Do you mean point 8 in the producer section (i.e. use of the schemaid variable)?

jcsp commented 1 year ago

I understand wanting to validate messages on the way in, but it's not obvious to me why someone would need to register a new schema at the same time as producing a message. Why wouldn't they use the schema registry API to register the schema?

jrkinley commented 1 year ago

@jcsp while I agree with you, I think the Pandaproxy should support both options in order to be compatible with Confluent's schema registry here (if that is the goal).

jcsp commented 1 year ago

I don't see the value_schema version (creating a schema inline with a produce) in the linked confluent docs page? I believe you, but it would be good to have a link to the docs that describe it.

CC @mattschumpert for awareness on the question of whether our API should aim to be wire-compatible with confluent's -- this particular request (creating schemas via pandaproxy) is an example of something that seems quirky and we probably wouldn't do otherwise.

jrkinley commented 1 year ago

@jcsp see the examples in their proxy quick start guide: https://docs.confluent.io/platform/current/kafka-rest/quickstart.html#produce-and-consume-avro-messages

thesammy2010 commented 1 year ago

There may be a misunderstanding, we'd want to send with value_schema_id, not the actual schema. The website didn't quite point to the right place on that

DmitriiMukhin commented 3 months ago

Upvoting the issue.

Also need such feature.

Produce and Consume Avro Messages

Produce a message using Avro embedded data, including the schema which will be registered with schema registry and used to validate and serialize before storing the data in Kafka

curl -X POST -H "Content-Type: application/vnd.kafka.avro.v2+json" \ -H "Accept: application/vnd.kafka.v2+json" \ --data '{"value_schema": "{\"type\": \"record\", \"name\": \"User\", \"fields\": [{\"name\": \"name\", \"type\": \"string\"}]}", "records": [{"value": {"name": "testUser"}}]}' \ "http://localhost:8082/topics/avrotest"