Open grandimk opened 2 years ago
Hi @grandimk, thanks for opening this issue. Can you please provide more information about your record schema? It looks like this is where your conversion fails: https://github.com/streamnative/pulsar-io-cloud-storage/blob/branch-2.9.3.6/src/main/java/org/apache/pulsar/io/jcloud/util/AvroRecordUtil.java#L204-L207
Hi @alpreu, I looked at the code in the AvroRecordUtil.java
file but couldn't figure out why the conversion fails.
This is the schema_info
of the record I used for my tests:
{
"type": "record",
"name": "Quiz",
"fields": [
{
"name": "audit",
"type": {
"type": "record",
"name": "Audit",
"fields": [
{
"name": "actor",
"type": {
"type": "record",
"name": "Actor",
"fields": [
{ "name": "actorId", "type": "string" },
{
"name": "actorType",
"type": {
"type": "enum",
"name": "ActorType",
"symbols": ["person", "service"]
}
},
{ "name": "ip", "type": ["null", "string"], "default": null }
]
}
},
{
"name": "producer",
"type": {
"type": "record",
"name": "Producer",
"fields": [
{ "name": "code", "type": "string" },
{
"name": "instanceId",
"type": ["null", "string"],
"default": null
},
{ "name": "producerType", "type": "string" },
{
"name": "version",
"type": ["null", "string"],
"default": null
}
]
}
}
]
}
},
{
"name": "metadata",
"type": {
"type": "record",
"name": "Metadata",
"fields": [
{ "name": "eventId", "type": "string" },
{ "name": "eventTimestamp", "type": "long" },
{ "name": "eventType", "type": ["null", "string"], "default": null }
]
}
},
{
"name": "payload",
"type": {
"type": "record",
"name": "QuizPayload",
"fields": [
{
"name": "categoryTreeNodesIds",
"type": ["null", { "type": "array", "items": "long" }],
"default": null
},
{ "name": "id", "type": "string" },
{ "name": "isHidden", "type": "boolean" },
{
"name": "knowledgeGraphNodesIds",
"type": ["null", { "type": "array", "items": "long" }],
"default": null
},
{
"name": "properties",
"type": {
"type": "record",
"name": "ContentProperties",
"fields": [
{
"name": "authorId",
"type": ["null", "string"],
"default": null
},
{
"name": "copiedFrom",
"type": ["null", "string"],
"default": null
},
{
"name": "createdAt",
"type": ["null", "string"],
"default": null
},
{
"name": "creatorId",
"type": ["null", "string"],
"default": null
},
{
"name": "ownerId",
"type": ["null", "string"],
"default": null
},
{
"name": "permission",
"type": {
"type": "record",
"name": "ContentPermission",
"fields": [
{
"name": "teams",
"type": [
"null",
{ "type": "array", "items": "string" }
],
"default": null
}
]
}
},
{
"name": "version",
"type": ["null", "string"],
"default": null
}
]
}
},
{
"name": "questions",
"type": { "type": "array", "items": "string" }
},
{
"name": "questionsNumber",
"type": ["null", "long"],
"default": null
},
{ "name": "quizId", "type": ["null", "string"], "default": null },
{
"name": "showDetailedFeedback",
"type": ["null", "boolean"],
"default": null
},
{ "name": "slug", "type": ["null", "string"], "default": null },
{ "name": "title", "type": "string" }
]
}
}
]
}
As additional note, I want to point out that we define our events using JSON schema and then generate both the related Python and Java classes.
@grandimk Thanks for providing the record schema. I had another look but I cannot see an immediate issue either.
Would you be able to create a unit test in ParquetFormatTest
that recreates the issue from your generated java record/schema?
Describe the bug I was using the Cloud Storage Sink to collect data from Pulsar and write it to AWS S3 in Parquet. Messages were produced using a
JsonSchema
format. The Sink fails as soon as it tries to convert the collected data intoorg.apache.avro.generic.GenericRecord
(within theconvertGenericRecord
function).It tried to produce messages both from Python and from Java and both fail but with different stack traces.
Note: if the
formatType
specified in the configuration isjson
everything works fine.To Reproduce Use this template configuration for the pulsar-io-cloud-storage
v2.9.3.6
:And produce messages in
JsonSchema
format. Here the code for a minimal Python producer:Expected behavior A chunk of data containing a list of collected messages, written to the specified AWS S3 prefix in Parquet format.
Screenshots None
Additional context The tests were done on my laptop, using an Apache Pulsar Docker container where the schema-registry was properly configured (the schema definition of the messages have been uploaded) and the version
pulsar-io-cloud-storage-2.9.3.6.nar
was loaded.This is the error occurred while writing data produced with the Python producer:
This is the error occurred while writing data produced with the Java producer: