Open razumau opened 1 year ago
Thanks for the report @razumau !
Yes, you are correct that the avro
encoding doesn't write the schema as a header. I think the common usage to this point has been with known schemas that users can then apply to the file separately when consuming. It is an obvious shortcoming though.
Fixing this would require a bit of an update to the encoding system within Vector which only deals with encoding individual events and not, for example, providing additional data to write at the beginning of each "batch". This same issue would come up with a csv
codec or other codecs that require "headers".
@jszwedko Hello! Is there any news on this issue? Are there any plans to add a header bytes and schema to the file header so that the resulting file is truly Avro compatible?
I'm afraid that at the moment in real use this sync is almost useless. Because almost every tool or system I tried to process the resulting file expects a schema or at least a header.
I'm not an expert on the Avro standard - but is the presence of a header without the data schema itself a valid Avro file?
Perhaps, if the previous assumption is correct, it is possible to add only a header to the final resulting avro file of the sync - so that the result can be read by standard libraries (in my case attempts to process the resulting file included python, apache spark and apache impala)
I'm not an avro expert, but my understanding is that the schema can be provided when decoding data files so it is usable albeit less convenient. Ideally the schema would be included in the file itself too.
+đź‘Ť as this issue prevents from using Google BigQuery with the generated files. Unfortunately BigQuery expects the header to be present and refuses to import it / use as a source data in an External table
A note for the community
Problem
Avro files generated by sinks miss header.
According to Avro’s specification, Avro files should begin with a schema. However, Avro files produced by File or S3 sinks start with data and don’t have headers.
It seems to be happening because the only Avro-writing method used in Vector is
to_avro_datum
(https://github.com/vectordotdev/vector/blob/6542778af87ec8324ff1e75b9e68cb3251d1931c/lib/codecs/src/encoding/format/avro.rs#L73), and the comment about it in apache-avro says that it doesn’t generate headers. This usage makes sense, because we don’t need headers there, however, headers are seemingly not being written someplace else.Am I missing something? Should there be more in my config? I’ve attached a minimal config that creates a small avro file. Avro files are in this gist.
Configuration
Version
0.28.1
Debug Output
Example Data
No response
Additional Context
No response
References
No response