gklijs / schema_registry_converter

A crate to convert bytes to something more useable and the other way around in a way Compatible with the Confluent Schema Registry. Supporting Avro, Protobuf, Json schema, and both async and blocking.
Apache License 2.0
98 stars 34 forks source link

Can this crate be made to support hyphenated subject names? #100

Open jslusher opened 8 months ago

jslusher commented 8 months ago

Is your feature request related to a problem? Please describe. We have an Avro schema registry set to be permissive about hyphens in its subject names. The python module has supported this for years now, as well as the Debezium source connector, KSQLdb and all of the sinks we've used. This crate doesn't seem to be able to handle subject names with hyphens.

Describe the solution you'd like I would like this crate to allow hyphens in subject names of permissive Avro schema registries.

Describe alternatives you've considered The alternative is that we re-emit several topics using underscores instead. This is a lot of work, and would only be necessary because of this crate. I'm not even sure ultimately that we can use this crate if we can't use hyphenated subject names.

Additional context You might know this already, but just in case, here's the feature that made Avro allow for hyphens in their subject names. As you can see, there are use cases where it's not realistic to be restrictive of something like a hyphen in a subject name since often the input is not within the control of the user writing the consumer.

gklijs commented 8 months ago

Hi Jon,

If it's something which is possible with the Java serializers, and not with this library, it should be considered a big.

For now, I don't really know the problem. For example there is this test using a hyphen in the subject name. If it's about some subject name within Avro itself, it's like an issue for the Avro depency used. In wich case there should be something changed to the code here.

Could you please provide additional information, so that I understand the problem better?

gklijs commented 8 months ago

I reread the issue and it's clear now. It would require both changes to this crate, and the implementation of avro for rust. Currently it's fixed to not allow hyphens.

While there are some issues to fix the current regex, as it also allows two dots on succession, I don't see any issue related to allowing hyphens. Feel free to create an issue on their Jira board. Once it's merged and released, we can support it from this library.

martin-g commented 8 months ago

The spec does not allow hyphens - https://avro.apache.org/docs/1.11.1/specification/#names

martin-g commented 8 months ago

The python module has supported this for years now

I am still not sure which Python module supports hyphens but it seems it is not the Avro Python SDK - https://github.com/apache/avro/blob/1652294d5691a895c392e34a2d57dae9701815c2/lang/py/avro/name.py#L36

gklijs commented 8 months ago

@martin-g, you are right that the spec doesn't allow it. But once it's supported by avro-rs, I don't see any harm in allowing it. The default will be to keep the validation like it is. Also, I don't want to fork/copy/maintain the code that does the checking. So if it's not supported by avro-rs, it won't be supported here as well. It also seems like it isn't supported in the java client. Either way, I'm happy to add functionality if it's easy, but it currently isn't. So, for now, this is on hold.

martin-g commented 8 months ago

But once it's supported by avro-rs

It won't be supported by apache_avro (the successor of avro_rs) unless it is in the spec!

I still think @jslusher talks about something else ...

gklijs commented 8 months ago

In that case, let's give him some time to reply. If that is what he meant, we can close this.

jslusher commented 8 months ago

@martin-g The python module I'm referring to is the confluent-kafka-python module. Versions 1.9.2 and later respect Avro's permissive setting. Take a look at the link and you'll see what I mean. Avro can be set to be more permissive about dashes since in many cases data might have hyphens in it.

@gklijs In our use case, we have a number of Avro schemas that have hyphens in only the subject name of the registry, but the schemas themselves adhere to the spec, so to me it seems like a bug that the crate would error on the subject name without checking the actual schema.

If I'm reading the spec properly:

Record, enums and fixed are named types. Each has a fullname that is composed of two parts; a name and a namespace. Equality of names is defined on the fullname. The name portion of a fullname, record field names, and enum symbols must: start with [A-Za-z] subsequently contain only [A-Za-z0-9]

It's referring to items within the schema and not the name of the schema itself.

Here's an example of what I'm talking about. You can see that the schema itself is fine, and the name of the schema has hyphens. Perhaps I'm misunderstanding the spec, but it seems to me that this should be legal.

Screenshot 2023-11-03 at 5 33 23 PM

It's also worth noting that all of our connectors (Elasticsearch, Debezium, JDBC) use Avro and support our current configuration.

gklijs commented 8 months ago

This is getting a bit confusing. Since than the link you shared, has nothing to do with the actual issue? Since the link is about the schema itself, while the problem is in the subject name?

At this point it's not clear to me why there could not be multiple hyphens in de schema name, but I can try to test this out. However since there are multiple ways, it would help if you can share the error you encountered.

jslusher commented 8 months ago

Since than the link you shared, has nothing to do with the actual issue?

I shared a couple of links. To which are you referring? If it's the Apache issue link, it's related to the issue in that Avro can permit these schema names, but this crate is not respecting that permissiveness.

it would help if you can share the error you encountered

I will see about getting you the specific error. I created this issue on behalf of a colleague who can hopefully reproduce it.

martin-g commented 8 months ago

@jslusher You shared few links but most of them link to https://issues.apache.org/jira/browse/NIFI-4612. This is Apache NIFI project and its PR is for Java, not Python - https://github.com/apache/nifi/pull/2275/files !

From my understanding you are talking about debezium.abc-def-20230101.value which is a registry related thingy (a subject ?!) and it is not a Avro schema name (SchemaChangeValue in your image above)! I think we are clear that there is no need of a change in Apache Avro SDKs!

Please provide the error/stacktrace that you face to make it more clear what and where to change!

jslusher commented 8 months ago

@martin-g I had initially confused that link for the issue that made Avro itself permissive, but I see now that it's the NIFI project allowing a permissive setting. Apologies for the confusion there.

The version of the confluent python client that did not work was 1.8.2. When I upgraded the module to 1.9.2 I no longer got errors for dashes in Schema Registry subject names. Looking at the release notes, I don't see anything directly related to hyphenated subject names, but there are a couple of fixes related to Avro in version 1.9.0 that might be related. I personally don't know rust very well and I can't myself reproduce this error in rust. I'll get back soon with the actual stack trace my colleague received.

jslusher commented 8 months ago

Here is the error we're getting:

thread 'main' panicked at /app/src/utils/kafka.rs:67:35:
Error decoding value: Error: Supplied raw value "{\"type\":\"record\",\"name\":\"Envelope\",\"namespace\":\"debezium.abc-123-efg-20231005.table.u_table_dbz\",\"fields\":[{\"name\":\"before\",\"type\":[\"null\",{\"type\":\"record\",\"name\":\"Value\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"uid\",\"type\":{\"type\":\"long\",\"connect.default\":0},\"default\":0},{\"name\":\"release_id\",\"type\":{\"type\":\"long\",\"connect.default\":0},\"default\":0},{\"name\":\"notes\",\"type\":\"string\"},{\"name\":\"notes_public\",\"type\":{\"type\":\"string\",\"connect.version\":1,\"connect.parameters\":{\"allowed\":\"Y,N\"},\"connect.default\":\"N\",\"connect.name\":\"io.debezium.data.Enum\"},\"default\":\"N\"},{\"name\":\"added_ts\",\"type\":{\"type\":\"long\",\"connect.version\":1,\"connect.name\":\"io.debezium.time.Timestamp\"}}],\"connect.name\":\"debezium.abc-123-efg-20231005.table.u_table_dbz.Value\"}],\"default\":null},{\"name\":\"after\",\"type\":[\"null\",\"Value\"],\"default\":null},{\"name\":\"source\",\"type\":{\"type\":\"record\",\"name\":\"Source\",\"namespace\":\"io.debezium.connector.mysql\",\"fields\":[{\"name\":\"version\",\"type\":\"string\"},{\"name\":\"connector\",\"type\":\"string\"},{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"ts_ms\",\"type\":\"long\"},{\"name\":\"snapshot\",\"type\":[{\"type\":\"string\",\"connect.version\":1,\"connect.parameters\":{\"allowed\":\"true,last,false,incremental\"},\"connect.default\":\"false\",\"connect.name\":\"io.debezium.data.Enum\"},\"null\"],\"default\":\"false\"},{\"name\":\"db\",\"type\":\"string\"},{\"name\":\"sequence\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"table\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"server_id\",\"type\":\"long\"},{\"name\":\"gtid\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"file\",\"type\":\"string\"},{\"name\":\"pos\",\"type\":\"long\"},{\"name\":\"row\",\"type\":\"int\"},{\"name\":\"thread\",\"type\":[\"null\",\"long\"],\"default\":null},{\"name\":\"query\",\"type\":[\"null\",\"string\"],\"default\":null}],\"connect.name\":\"io.debezium.connector.mysql.Source\"}},{\"name\":\"op\",\"type\":\"string\"},{\"name\":\"ts_ms\",\"type\":[\"null\",\"long\"],\"default\":null},{\"name\":\"transaction\",\"type\":[\"null\",{\"type\":\"record\",\"name\":\"block\",\"namespace\":\"event\",\"fields\":[{\"name\":\"id\",\"type\":\"string\"},{\"name\":\"total_order\",\"type\":\"long\"},{\"name\":\"data_collection_order\",\"type\":\"long\"}],\"connect.version\":1,\"connect.name\":\"event.block\"}],\"default\":null}],\"connect.version\":1,\"connect.name\":\"debezium.abc-123-efg-20231005.table.u_table_dbz.Envelope\"}" cant be turned into a Schema, was cause by Invalid namespace debezium.abc-123-efg-20231005.table.u_table_dbz. It must match the regex '^([A-Za-z_][A-Za-z0-9_]*(\.[A-Za-z_][A-Za-z0-9_]*)*)?$', it's retriable: false, it's cached: false

The reference in the panic message is referring to these lines in our consumer:

use schema_registry_converter::async_impl::avro::AvroDecoder;
...
    let sr_settings = SrSettings::new(args.avro_url);
    let decoder = AvroDecoder::new(sr_settings.clone());
...
                    let avro_payload = match decoder.decode(m.payload()).await {
                        Ok(r) => r.value,
                        Err(e) => panic!("Error decoding value: {}", e),
                    };

...
gklijs commented 8 months ago

As I see it, the input is just wrong. In the image you shared, the namespace of the record is io.debezium.connector.mysql in the code it's debezium.abc-123-efg-20231005.table.u_table_dbz which is an illegal namespace.

You are free in which schema registry subject, points to wich registered schema. But you, I think correctly, can't register an Avro Schema if it's not according to the specification.

Please let us know what you think.

jslusher commented 8 months ago

@gklijs Apologies, but I actually shared the wrong schema earlier. The schema that this consumer is trying to use does have a namespace with hyphens:

{
  "type": "record",
  "name": "Envelope",
  "namespace": "abc-123-efg-20231005.table.u_table_dbz",
  "fields": [
    {
      "name": "before",
      "type": [
        "null",
        {
...

How is it that Avro is allowing the namespace there to have the illegal characters but our rust consumer complains about them? Our python consumers can accept these schemas but this AvroDecoder can not. I personally don't think it should be up to the decoder here to determine what's legal in the schema registry, especially if all it's doing is fetching the schema that is already registered. I can understand if this was a producer trying to register a schema that doesn't exist yet, but this in effect is rejecting a schema that our registry has accepted.

I think at the very least there should be an option to defer to the registry for what Avro will accept; if this decoder could have a permissive setting for instance, we could work around this issue. In my opinion this decoder is arbitrarily making restrictions on the registry that the registry itself isn't enforcing.

martin-g commented 8 months ago

Which library do you use for working with Avro in Python ?

On Tue, 7 Nov 2023 at 19:27, Jon Slusher @.***> wrote:

@gklijs https://github.com/gklijs Apologies, but I actually shared the wrong schema earlier. The schema that this consumer is trying to use does have a namespace with hyphens:

{ "type": "record", "name": "Envelope", "namespace": "abc-123-efg-20231005.table.u_table_dbz", "fields": [ { "name": "before", "type": [ "null", { ...

How is it that Avro is allowing the namespace there to have the illegal characters but our rust consumer complains about them? Our python consumers can accept these schemas but this AvroDecoder can not. I personally don't think it should be up to the decoder here to determine what's legal in the schema registry, especially if all it's doing is fetching the schema that is already registered. I can understand if this was a producer trying to register a schema that doesn't exist yet, but this in effect is rejecting a schema that our registry has accepted.

— Reply to this email directly, view it on GitHub https://github.com/gklijs/schema_registry_converter/issues/100#issuecomment-1799291694, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABYUQRCRSBMJWCP6GB4E5TYDJVQVAVCNFSM6AAAAAA6X7EHWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJZGI4TCNRZGQ . You are receiving this because you were mentioned.Message ID: @.***>

jslusher commented 8 months ago

@martin-g We use confluent_python[avro]. Versions 1.9.2 or greater seem to support those namespaces. The example in their repo, using AvroDeserializer is close enough to what we have.

martin-g commented 8 months ago

This is the Kafka client. It should use some Avro library.

You may want to ask for help at the Kafka and/or Avro mailing lists. According to the link I pasted earlier the official Apache Avro Python SDK does not allow hyphens in the schema name(space)

On Tue, 7 Nov 2023 at 20:40, Jon Slusher @.***> wrote:

@martin-g https://github.com/martin-g We use confluent_python[avro] https://github.com/confluentinc/confluent-kafka-python. Versions 1.9.2 or greater seem to support those namespaces. The example in their repo https://github.com/confluentinc/confluent-kafka-python/blob/master/examples/avro_consumer.py, using AvroDeserializer is close enough to what we have.

— Reply to this email directly, view it on GitHub https://github.com/gklijs/schema_registry_converter/issues/100#issuecomment-1799521858, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABYUQWS5AQLAFCNPPPVZ2DYDJ6CPAVCNFSM6AAAAAA6X7EHWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJZGUZDCOBVHA . You are receiving this because you were mentioned.Message ID: @.***>

jslusher commented 8 months ago

I appreciate your working through this with me. You're saying that rust's Avro SDK is what isn't supporting these names?

I see in the serializer section of the Python module that it seems to be using avro which is the same SDK to which you refer?

It looks like the most recent change there refers to Python's permissiveness of these name conflicts.

So this is in place for the Python client's use of the avro sdk but not for rust?

martin-g commented 8 months ago

Thanks for this link (https://github.com/apache/avro/commit/012338f959c50f41bd41c7ad6ff8d26262985105) ! I just approved your request for JIRA account at ASF! If Java and Python allow disabling of the validation due to the hyphen then I think we should just include the hyphen in the regex in the specification! But this should be discussed at dev@avro.apache.org !

jslusher commented 8 months ago

@martin-g Thank you for hearing me out!

I opened this issue (AVRO-3900) with the request. Let me know if you need any more information there to proceed.

martin-g commented 8 months ago

But this should be discussed at dev@avro.apache.org !

I hear you but you don't hear me! I asked you to start a discussion at dev@avro.apache.org

jslusher commented 7 months ago

Ah I misunderstood you there. I just sent an email to that group with a description of the issue and linked this issue and the issue I opened Jira.

martin-g commented 6 months ago

I've opened https://github.com/apache/avro/pull/2643 - it allows to register custom names' validators. This way any application can define its own rules if needed.

gklijs commented 5 months ago

It was merged to master in Avro. So it should work once there is a new release, and we update to that version. I'll check/add a test once it's there before closing this.