lerouxrgd / rsgen-avro

Command line and library for generating Rust types from Avro schemas
MIT License
37 stars 29 forks source link

Named dependent schemas #34

Closed gudjonragnar closed 2 years ago

gudjonragnar commented 2 years ago

I am working with a lot of named schemas defined in different files and trying to generate structs. I have boiled it down to a simple example with two struct defined in separate files where A includes B:

# schemas/a.avsc
{
  "namespace": "a",
  "type": "record",
  "name": "A",
  "fields": [
    {"name": "first", "type": "b.B"},
    {"name": "second", "type": "b.B"}
  ]
}

# schemas/b.avsc
{
  "namespace": "b",
  "type": "record",
  "name": "B",
  "fields": [
    {"name": "first", "type": "string"}
  ]
}

I then have a simple script to read the two schemas and parse them using version 0.14.0 of the apache-avro crate before trying to generate a struct representation of A.

use std::fs::File;

use apache_avro::Schema;
use rsgen_avro::{Generator, Source};

fn main() {
    let path = std::env::current_dir().unwrap();
    let mut out_path = path.to_owned();
    out_path.push("output");

    let mut raw_schemas = Vec::new();
    for schema in &["schemas/b.avsc", "schemas/a.avsc"] {
        let mut p = path.to_owned();
        p.push(*schema);
        dbg!(&p);
        raw_schemas.push(std::fs::read_to_string(&p).unwrap());
    }
    let schemas = &raw_schemas.iter().map(|s| s.as_str()).collect::<Vec<_>>();
    let schemas = Schema::parse_list(&schemas).unwrap();
    dbg!(&schemas);

    let to_generate = "a";
    let mut tmp_path = out_path.to_owned();
    tmp_path.push(&format!("{}.rs", to_generate));
    std::fs::create_dir_all(tmp_path.parent().unwrap()).expect("Failed creating dirs");
    let mut out = File::create(&tmp_path).unwrap();
    let generator = Generator::builder().build().unwrap();
    generator
        .gen(&Source::Schema(schemas.get(1).unwrap()), &mut out)
        .unwrap();
}

The script fails at the generation step because apache-avro now writes the schema for A using Schema::Ref for the B struct, which is not handled. The error I get is

thread 'main' panicked at 'Schema reference 'Name { name: "B", namespace: Some("b") }' cannot be resolved', /..../rsgen-avro-0.11.0/src/templates.rs:376:25

Here you can see the schema parsed by apache-avro

    Record {
        name: Name {
            name: "A",
            namespace: Some(
                "a",
            ),
        },
        aliases: None,
        doc: None,
        fields: [
            RecordField {
                name: "first",
                doc: None,
                default: None,
                schema: Ref {
                    name: Name {
                        name: "B",
                        namespace: Some(
                            "b",
                        ),
                    },
                },
                order: Ascending,
                position: 0,
            },
            RecordField {
                name: "second",
                doc: None,
                default: None,
                schema: Ref {
                    name: Name {
                        name: "B",
                        namespace: Some(
                            "b",
                        ),
                    },
                },
                order: Ascending,
                position: 1,
            },
        ],
        lookup: {
            "first": 0,
            "second": 1,
        },
    }

I guess this could be solved in two ways:

  1. Add an option to apache-avro to inline the dependent schemas
  2. Add an option ot rsgen-avro to supply dependent schemas to generator and lookup the schema to generate.

...or if I am doing something wrong it would be great to get some feedback on that.

lerouxrgd commented 2 years ago

As for the namespace part it is currently not supported (as stated in README.md), see: #31 for progress on this.

As for the cross dependencies it should be fine if you provide the schemas together, see the following test. Is your use-case different from this test ?

gudjonragnar commented 2 years ago

The GlobPattern option does works in a way, i.e. it does generate everything. However, due to the complexity of the schemas it results in a naming conflict. I have a few top level schemas that reuse a lot of the same sub schemas. However many of the top level schemas have fields that are named the same and the value is a named schema with the same name but not the same structure. For example:

# schemas/a.avsc
{
  "type": "record",
  "name": "A",
  "fields": [
    {"name": "first", "type": {
         "name": "InnerSchema", "type": "record", "fields": [{"name": "innerfield", "type": "string"}]
      }
    },
  ]
}
# schemas/b.avsc
{
  "type": "record",
  "name": "B",
  "fields": [
    {"name": "first", "type": {
         "name": "InnerSchema", "type": "record", "fields": [{"name": "some_other_name", "type": "string"}]
      }
    },
  ]
}

If I use GlobPattern("schemas/*.avsc") here it would generate two InnerSchema structs with different definitions. I guess one way would be to collect relevant schemas together when building so I can target only the relevant ones with GlobPattern. It would be nice to have the option to more specifically specify the top level schema and its dependencies. Would you accept a PR for that?

lerouxrgd commented 2 years ago

I think your latest example would work if namespaces were supported.

As for schema selection flexibility, is the glob pattern not powerful enough to cherry pick the schemas you need ? There are many options as you can see in glob tests

lerouxrgd commented 2 years ago

If Source::GlobPattern is not flexible enough, we could add something like Source::Schemas(&[&Schema]).

gudjonragnar commented 2 years ago

I would have to do something like

GlobPattern("some/path/[a|b|c]/*.avsc");

which I dont think is supported by glob (as far as I can see). Essentially I have to be able to tell glob to fetch all avsc from a set of directories. I can take a stab at implementing Source::Schemas... if I find time.

lerouxrgd commented 2 years ago

Using globset instead of glob should be better since in your case "some/path/{a,b,c}/*.avsc" would work, right ?

gudjonragnar commented 2 years ago

Ye I think that should work for me. Is globset a drop in replacement? Just in case it might break for some users.