kafkajs / confluent-schema-registry

is a library that makes it easier to interact with the Confluent schema registry
https://www.npmjs.com/package/@kafkajs/confluent-schema-registry
MIT License
154 stars 101 forks source link

Question/Feature: Auto create schema from payload #80

Closed rob3000 closed 3 years ago

rob3000 commented 3 years ago

This is more of a question/enhancement, i was wondering if it would be possible to create a schema from a payload that is being sent say when using KafkaJS?

something like:

const kafka = new Kafka({
  logLevel: logLevel.DEBUG,
  brokers: [`${host}:9092`],
  clientId: 'example-producer',
})

const topic = 'topic-test'
const producer = kafka.producer();

const message = {
  topic,
  key: '1234',
 value: {
    foo: 'bar',
    baz: ' zee'
 }
}
// Store the schema
registry.register(message).then(result => {
  producer.send({
      topic,
      compression: CompressionTypes.GZIP,
      messages: [
        {
          key: result.key,
          value: result.value
        }
     ],
  })
})

One of the limitations i can see would be blocking the sending of a message until the schema has been created. Thoughts?

Nevon commented 3 years ago

Blocking is not an issue. You just wrap producer.send in another async function that possibly generates the schema, encodes the key and value and then sends (note that this would be done outside of @kafkajs/confluent-schema-registry).

The problems I see are:

  1. What would be the point of having a schema if we generate it at runtime by introspecting the payload? If you make a mistake when generating the payload, that will just lead to an incorrect schema being published, and so now consumers have to deal with an incorrect message anyway. To me, there's not really any clear advantage over using schemaless JSON in that case.
  2. It would only really be possible to create very simple schemas, since JS doesn't have any way of encoding more complex type information. For example, we can figure out that a value should be a string, but we can't know if it's nullable or not. We also can't know if something is an enum or a union. There's not really any way of knowing if two objects are actually the same type or if they just happen to look the same (maybe a "Product" and a "User" both have a "name" property - that doesn't mean they are both the same thing). From typescript types, you could generate pretty good schemas, but that would be at build time, not runtime, and falls outside of the scope of this package.
  3. Generated type names would be very generic, so the schema wouldn't really help much in terms of documentation. You could possibly use the property name to generate the type name for the value, but it wouldn't be great in a lot of cases, and we'd still need to deal with multiple types with the same name (Foo1, Foo2).

All in all, I don't see this being a useful feature. I could see a standalone tool for generating schemas from data being useful as a starting point, though. So you'd feed it an example of the data you're expecting to send, it generates a schema and then you tweak the schema to deal with some of the issues I mentioned above. It'd take some of the busywork out of writing schemas.

But I definitely don't see that being part of @kafkajs/confluent-schema-registry. The scope of this package is pretty clear - it interfaces with the Confluent schema registry to register and fetch schemas and then let's you encode/decode payloads. Schema creation is outside of the scope and doesn't really overlap with either @kafkajs/confluent-schema-registry or kafkajs.

rob3000 commented 3 years ago

Thanks for the great response @Nevon and all the problems raised make perfect sense. Thanks!