Open ChameleonTartu opened 10 months ago
Hi @ChameleonTartu Thanks for the question :)
After a quick look into avro it seems to me that its a format used for transferring data quickly. It contains a schema, which the avro writer needs to validate incoming data it is going to write. if its valid, the avro writer now encodes the schema + data into a binary blob. Any avro reader can now read and decode the data properly because the schema is included in the blob.
From what I understand. Kaiba could be used both in front before data injection into avro to make arbitrary data conform with what avro expects. or behind, after the avro reader has read the data and output some json to turn it into a more desired format. My initial thinking is that I dont think Kaiba should need to handle the schema part of avro. But I'll give this more thought.
I've been contemplating adding some pre/post processors directly into kaiba core. But i'm not sure if its the right place.
I've also just checked protobuf quickly right now and I was wondering if you could explain a bit more about the usecase. Is your usecase to change the .proto
schema data into a JSONSchema
decleration? Or is it again to change the data before injection and after reading?
@thomasborgen I will bring a bit of context as my use cases are from Data Engineering and working with Apache Kafka, not from the integration space where we have used Kaiba.
CONTEXT:
In a broad sense, Apache Kafka is a Data Bus where you dump your data (produce) and read it after (consume). Kafka doesn't care about intake; it knows and stores bytes. It has topics that are practically different channels/queues where you put your messages for separation.
A reader or writer of Kafka needs an extra Schema Registry. Schema registry stores schemas in AVRO, JSON (JSON Schema), or PROTOBUF formats. The algorithm that readers and writers follow will be:
Writer:
Reader:
The common part, independently of the schema format, is
schema = read_schema()
validate(message, schema)
PROBLEM:
There is no simple way to change the schema, so I cannot take AVRO to convert to JSON Schema or JSON Schema to PROTOBUF. The expected behavior will be to convert schema to schema with no pain:
AVRO <-> JSON PROTOBUF <-> JSON
The reason for this use-case to exist and correlation with Kaiba:
1. Read messages from topic ABC. Message in JSON format
2. Enrich or trim the messages and post in topic XYZ. A message should be in AVRO format. (Kaiba-related)
3. Make a decision and post to ATP, and the message should be in PROTOBUF format.
Enrichment or data manipulation is truly Kaiba's existence story, but integration between formats must be solved. As I wrote before, JSON to AVRO and JSON to PROTOBUF are partially industry-solved, while conversion between JSON Schema and other schemas is a widely open question that, to my knowledge, still needs to be solved.
QUESTIONS: Is it kaiba-related? Yes, partially because Kaiba is great at manipulating data based on the schema. Do you think this particular request should go to Kaiba-core? Not necessarily, it can go to Kaiba eco-system and help to promote it in Data Engineering niche.
I am open to discussion and ready to contribute to this branch of the project as I see a great need for it myself.
Hi again @ChameleonTartu I did a test where i changed a avro schema into a json schema with kaiba using the Kaiba App and it worked. However, it made it clear that we have a limitation in kaiba. We are unable to transform values into keys. for example in avro a field is defined like this:
{"name": "field_name", "type": "string"}
But in JSONSchema a field's name is its key as in:
{
"properties": {
"field_name": {
"type": "string"
}
}
}
I think this is something that we should look into supporting since it could be very powerful. This would include getting a keys name instead of its value and also maybe extending the kaiba object to make it possible to have a dynamic name.
For Protobuf, since its not JSON data we can't map directly to it. We can only map to the correct structure and let a post-processor handle the dump from json to protobuf.
Here is how I changed the avro schema into a jsonschema:
Given this Avro schema
{
"type": "record",
"namespace": "Tutorialspoint",
"name": "Employee",
"fields": [
{"name": "Name", "type": "string"},
{"name": "Age", "type": "int"}
]
}
And this kaiba config
{
"name": "root",
"array": false,
"iterators": [],
"attributes": [
{
"name": "title",
"default": "Employee"
},
{
"name": "type",
"default": "object"
}
],
"objects": [
{
"name": "properties",
"array": false,
"objects": [
{
"name": "Name",
"array": false,
"iterators": [],
"attributes": [
{
"name": "type",
"data_fetchers": [
{
"path": ["fields", 0, "type"]
}
]
}
]
},
{
"name": "Age",
"array": false,
"iterators": [],
"attributes": [
{
"name": "type",
"data_fetchers": [
{
"path": ["fields", 1, "type"]
}
]
}
]
}
]
}
]
}
You can produce this:
{
"title": "Employee",
"type": "object",
"properties": {
"Name": {
"type": "string"
},
"Age": {
"type": "int"
}
}
}
@thomasborgen This works, the only issue it doesn't do any magic, it is very manual based and it requires understanding of both formats very well and kaiba itself.
Even though, I think this is a great solution, so I include it in the manual of transfomation AVRO to JSON.
Do I understand correctly that there is no "reverse" transformation availble, yet?
@thomasborgen, I believe this was not an initially intended usage of Kaiba. But is it possible to convert AVRO to JSON Schema or PROTOBUF to JSON schema?
Many tools are available in Python, Java, and other languages—for example, https://github.com/criccomini/twister.
All those tools are converting data to data. Is there a way to extend Kaiba to generate both data in JSON and JSON Schema?