lerouxrgd / rsgen-avro

Command line and library for generating Rust types from Avro schemas
MIT License
37 stars 29 forks source link

Unable to make it work for nested types in different Schema Files. #44

Open markfarnan opened 1 year ago

markfarnan commented 1 year ago

Looking for guidance on how to make rsgen-Avro and avro-rust work when there are multiple schema files, where one file relies on a type in another file.

rsgen-avro seems to create the schema's correctly. Schema::parse_list seems to parse the schema's correct.

Writer throws validation errors. (i've also trying using more lower level primatives, such as to_avro_datum, same issue) Line 81 in the sample code below: Err(SchemaResolutionError(Name { name: "ErrorInfo", namespace: Some("Energistics.Etp.v12.Datatypes") }))

Sample code to show the issue: NOTES: -- This is a subset of what I need to do. The full Schema definition I need to use is 200+ .AVSC files, with multiple nested / reused types.
-- I can't change the structure of the Schema files, as this is a Published Industry Standard in our domain for data interchange using a wire protocol. (I will actually use to_avro_datum for the final version, as we just want the raw bytes, not the headers etc). -- I'm getting some other errors with the rsgen (serde-bytes, and unable to use derive-schemas) that might post seperatly. I can share the full set of schema's if it is usefull for debugging.

Sample code showing the problem.


// Test File
use apache_avro::{Schema, Writer};

// Structs Generated from rsgen-avro with commandline -->   rsgen-avro  "*.avsc" schema.rs
// Interestingly trying to use "--nullable" in rsgen-avro throws an error "Templating error: Failed to render 'record.tera'" might need to look at that seperatly.
//
#[derive(Debug, PartialEq, Eq, Clone, serde::Deserialize, serde::Serialize)]
pub struct ProtocolException {
    pub error: Option<ErrorInfo>,
    #[serde(default = "default_protocolexception_errors")]
    pub errors: ::std::collections::HashMap<String, ErrorInfo>,
}

#[inline(always)]
fn default_protocolexception_errors() -> ::std::collections::HashMap<String, ErrorInfo> {
    ::std::collections::HashMap::new()
}

#[derive(Debug, PartialEq, Eq, Clone, serde::Deserialize, serde::Serialize)]
pub struct ErrorInfo {
    pub message: String,
    pub code: i32,
}

// Main Test
fn main() {
    let errinfo_schema = r#"
    {
        "type": "record",
        "namespace": "Energistics.Etp.v12.Datatypes",
        "name": "ErrorInfo",
        "fields":
        [
            { "name": "message", "type": "string" },
            { "name": "code", "type": "int" }
        ]
    }
    "#;

    let protocolexception_schema = r#"
    {
        "type": "record",
        "namespace": "Energistics.Etp.v12.Protocol.Core",
        "name": "ProtocolException",
        "protocol": "0",
        "messageType": "1000",
        "senderRole": "*",
        "protocolRoles": "client, server",
        "multipartFlag": true,
        "fields":
        [
            { "name": "error", "type": ["null", "Energistics.Etp.v12.Datatypes.ErrorInfo"] },
            {
                "name": "errors",
                "type": { "type": "map", "values": "Energistics.Etp.v12.Datatypes.ErrorInfo" }, "default": {}
            }
        ]
    }
    "#;

    let schema_list: [&str; 2] = [protocolexception_schema, errinfo_schema];
    let schemas_raw = Schema::parse_list(&schema_list);

    println!("{:?}", schemas_raw);
    println!("{:?}", "--------------");

    let schemas = schemas_raw.unwrap();

    let pe = ProtocolException {
        error: Some(ErrorInfo {
            message: String::from("some Error Message"),
            code: 451,
        }),
        errors: ::std::collections::HashMap::new(),
    };

    println!("{:?}", schemas[0]); // The 'ProtocolException' Schema in the list
    println!("{:?}", "--------------");

    let mut writer = Writer::new(&schemas[0], Vec::new()); // <-- Should I be able to pass the 'list' of Schema's here somewhere, so it resolves all the sub types ?

    let result = writer.append_ser(pe); // <-- Schema Resolution Error !  Missing the ErrorInfo Subtype
    println!("{:?}", result);

    let encoded = writer.into_inner();
    println!("{:?}", encoded);

    // Equivelent READER code to go here, once WRITER works.
}
martin-g commented 1 year ago

@markfarnan I think the correct place to report this problem is https://issues.apache.org/jira/projects/AVRO The best would be to create a JIRA issue and a PR with a test that fails in https://github.com/apache/avro/blob/master/lang/rust/avro/src/writer.rs Please try to make it as minimal as possible by removing fields and attributes which are not really needed!

markfarnan commented 1 year ago

Thanks Martin,

So I take it this should work (in theory !) as I'm using it ?

Or put another way, that the Schema returned in the [0] element of the vector (in this case), SHOULD be a full schema ?

Didn't want to report an error if I'm just using it wrong etc.

martin-g commented 1 year ago

So I take it this should work (in theory !) as I'm using it ?

I am not sure what was the initial purpose of this API. It definitely is not usable in the current version of apache_avro crate (even in master branch)! But I think I have an idea how to implement it! I just need to find time to do it :-)

markfarnan commented 1 year ago

Sounds good.

Interesting, if I derive the schema's from the Objects (using --derive-schemas in rsgen-avro), the resultant schema works fine to writer!.

I've attached two files for reference in case it is usefull, the first one is the 'Failing' parsed Schema. (the result of the PrintLn) that just uses the list, the second is the 'working' schema that I obtained from. ProtocolException::get_schema()

It looks like the parse_list function is just putting in references, which are unresolvable, where are the get_schema() is putting in valid Unions that embed the refered schema.

Note: I don't think I can use derive-schemas for my project, as the resultant file from rsgen-avro has dozens of errors. Some types dont' support derive.
Also I need to use. value, and to_datum as I just want bytes.

I'll try and work out how to do that test / Jira for Avro soon as I can, as using these AVRO Schemas is critical for us switching to RUST for some new work. bad_schema.txt good_schema.txt

lerouxrgd commented 1 year ago

Indeed issues related to Avro Writer/Reader should be reported to apache_avro since rsgen-avro is really just for Rust types generation. However if you have a valid Avro schema that generates invalid Rust types, then you should open an issue here.

I'm getting some other errors with the rsgen (serde-bytes, and unable to use derive-schemas) ... I don't think I can use derive-schemas for my project, as the resultant file from rsgen-avro has dozens of errors. Some types dont' support derive.

This is strange, yes please open a separate issue for this with a basic example to reproduce the issue.

Interestingly trying to use "--nullable" in rsgen-avro throws an error "Templating error: Failed to render 'record.tera'"

This is strange too, please open a separate issue too.

Beside, if I've understood correctly the Schema you get with Schema::parse_list is different from the one in AvroSchema::get_schema ?

This is unexpected indeed, but this issue should be addressed in apache_avro instead. @martin-g Actually Schema::parse_list originated from rsgen-avro/issues/6 which in turn created avro-rs/pull/173.

markfarnan commented 1 year ago

JIRA and PR Created in Apache Avro for this issue.

I'll see if I can isolate some examples for the other issues.
One shoudl be easy, as anything using _serdebytes dosn't want to compile.

The SchemaDerive one is going to be more interesting, as the schema it breaks on, has a a dozen or so dependancies.

lerouxrgd commented 1 year ago

One shoudl be easy, as anything using serde_bytes dosn't want to compile.

Are you including serde_bytes in your dependencies ? It is required in this case.

markfarnan commented 1 year ago

Yes:

[dependencies] clap = { version = "4.0", features = ["derive"] } serde = { version = "1.0", features = ["derive"] } serde_bytes = "0.11.7" apache-avro = { version = "0.15.0", features = ["derive"]} uuid = "1.2.2" glob = "0.3.0"

It's wierd, because several things annotated with #[serde(with = "serde_bytes")]. work fine, just a couple don't. That said, they are complex unions, of arrays of various types. So I'm just trying to narrow it down to something reproducible.

The AvroSchema stuff breaks on some Enums. with "error: AvroSchema derive does not work for enums with non unit structs"

It dosn't like things like this it seems: (I'm too new at this to have a clue as to why)

[derive(Debug, PartialEq, Clone, serde::Serialize, apache_avro::AvroSchema)]

pub enum UnionLongDoublePassIndexedDepth { Long(i64), Double(f64), PassIndexedDepth(PassIndexedDepth), }

martin-g commented 1 year ago

See https://github.com/apache/avro/pull/1921

markfarnan commented 1 year ago

Thanks,

Unfortunatly I'm too new at Rust for that to make a lot of sense as yet !.

However, on a seperate point.
Currently using 'AvroSchema' relies on regenerating a valid schema, from the objects which rsgen-Avro has created.

rsgen already knows a fully valid, resolved schema when it creates the objects in the first place (with all the extra fields etc, like doc, and real namespaces.). It seems counter intuitive to have to regenerate / reparse that schema to then USE those objects.

Would it be possible instead, for rsgen to store that schema with the objects themselves, so that 'getSchema' actually returns a guarenteed valid schema, by using the parsed version is used to create them in the first place?

This is the approach taken by the Avro implementation in GO. It actually takes it a step further and creates Serialize and Deserialize functions that return valid Go objects, but that dosn't seem necessary using Serde, if the schema's are correct.

martin-g commented 1 year ago

Would it be possible instead, for rsgen to store that schema with the objects themselves, so that 'getSchema' actually returns a guarenteed valid schema, by using the parsed version is used to create them in the first place?

This should be easy to do by providing an impl for AvroSchemaComponent

lerouxrgd commented 1 year ago

It's not really implementable as long as Names (which is part of the method signature) is not public.

Besides I have an implementation for this on branch impl-schema-component. It looks fine for simple cases but if the schema has dependencies etc, it might not work...

untereiner commented 1 year ago

The schema variable line 430 should be the expanded version of the schema to take into account of the dependencies. I was looking to add a recursive construction of the schema along the fields of the objects.

markfarnan commented 1 year ago

It would be ideal if it: A: Stored the Schema used to generate a given struct, with the struct, returning it from the AvroSchem interface B: The schema was 'complete', including all dependencies down the chain.

markfarnan commented 1 year ago

As an aside, I've gotten the complex structs working, using PR 28 from serde_bytes + merging a couple other PR's for Avro. (3684 for Multi schema, plus the 'fixed' ones. ) and some hacking for the nullable unions.

If interested, this is all in branch 'etp-working' of https://github.com/markfarnan/avro/tree/master/lang/rust/avro

However, there are a lot of fixes required from the output of rsgen-avro before they are useable. It won't even compile 'out of the box' due to some misplaced serde_bytes entires.

I plan to narrow them all down and make requests once the upstream rust-avro and serde-bytes are merged and I can see the final decisions on them.

If you're interested, the schema's and generated file is here: https://github.com/Bardasz/etp-rs/tree/main/schema And the correct file, with a list of all the changes needed is here: https://github.com/Bardasz/etp-rs/blob/main/src/schema_gen.rs

The most annoying change, is to go through and put 'serde-bytes' above each 'uuid' entry (which is a type for [u8,16] )

Once the upstream is sorted, I'll try and turn any remaining errors into failing tests or simple examples.

lerouxrgd commented 1 year ago

UnionBooleanIntLongFloatDoubleStringArrayOfBooleanArrayOfNullableBooleanArrayOfIntArrayOfNullableIntArrayOfLongArrayOfNullableLongArrayOfFloatArrayOfDoubleArrayOfStringArrayOfBytesBytesAnySparseArray

Wow

It might be interesting to have custom variant naming options for such cases, something like MyEnum{ V0(_), V1(_) }. But let's consider that once the other issues are solved.

Looking forward to the upstream apache-avro update so that we can tackle the issues you mentioned ! Thanks for looking into all this.

markfarnan commented 1 year ago

UnionBooleanIntLongFloatDoubleStringArrayOfBooleanArrayOfNullableBooleanArrayOfIntArrayOfNullableIntArrayOfLongArrayOfNullableLongArrayOfFloatArrayOfDoubleArrayOfStringArrayOfBytesBytesAnySparseArray

Wow

It might be interesting to have custom variant naming options for such cases, something like MyEnum{ V0(_), V1(_) }. But let's consider that once the other issues are solved.

Looking forward to the upstream apache-avro update so that we can tackle the issues you mentioned ! Thanks for looking into all this.

Can see why I want to rename it I'm sure :).

That complex nullable, 'DataValue' Union, and the fixed [u8,16] were the two main culprits of various problems.

untereiner commented 1 year ago

just for notice:

// 7. Rename 'Self' to 'Self_' in ContextScopeKind (Self is a reserved word)

should be corrected with #47

lerouxrgd commented 3 months ago

@markfarnan with latest release 0.14 things seem to have improved.

I have generated code for ETP v1.2 using the following approach:

git clone https://bitbucket.org/energistics/etpv1.git
cd etpv1/
git checkout etpv1.2

cd ..
cargo new --bin gen-etp
cd gen-etp/
cargo add apache-avro serde
# then add `mod etp;`  in the first line of main.rs

cd ..
rsgen-avro "etpv1/src/Schemas/Energistics/**/*" gen-etp/src/etp.rs

The only compilation error that I have now is that 3 versions of pub struct Chunk are generated. This is due to the unsupported namespace field in their schema definition. This issue is tracked in #31. Although, in this particular case they all have the same schema (that is blobId Uuid and data bytes), therefore if you delete the 2 extra generated definitions it should be fine.