mtth / avsc

Avro for JavaScript :zap:
MIT License
1.29k stars 148 forks source link

Dynamic schema loading #468

Closed joscha closed 2 months ago

joscha commented 2 months ago

I am currently trying to load schemas from an automatically generated set of avro schemas. It is basically a bunch <typeName>.avsc files in a directory. They reference each other. So when I load one like this:

Avro.Type.forSchema(JSON.parse(fs.readFileSync(__dirname + `/avro-schema/${fieldValueType}.avsc`, 'utf-8'))

where fieldValueType is 'CompanyValue':

{
  "namespace": "model",
  "type": "record",
  "doc": "",
  "name": "CompanyValue",
  "fields": [
    {
      "name": "type",
      "type": {
        "type": "enum",
        "name": "CompanyValue_type",
        "symbols": [
          "company"
        ]
      },
      "doc": "The type of value"
    },
    {
      "name": "data",
      "type": "model.CompanyData",
      "doc": ""
    }
  ]
}

it yields:

undefined type name: model.CompanyData

because the schema from CompanyData.avsc has not yet been loaded.

Initially I thought the answer would be to define an import hook: https://github.com/mtth/avsc/blob/7cb76a6eb7b0d5b5a71c9cfb446acf3eff763006/lib/files.js#L14 however it seems this is only supported for: https://github.com/mtth/avsc/blob/7cb76a6eb7b0d5b5a71c9cfb446acf3eff763006/lib/specs.js#L23

I can see that there is a local registry map I can pass into each opt: https://github.com/mtth/avsc/blob/7cb76a6eb7b0d5b5a71c9cfb446acf3eff763006/lib/types.js#L103 so I could just load all files into the registry with:

registry["model.CompanyData"] = Avro.Type.forSchema(JSON.parse(fs.readFileSync(...))

however I'd need to do that in a specific order which means I'd need to parse the model schemas myself beforehand.

I am wondering if you'd be up for adding an additional

resolver: (typeName: string) => Schema | Schema[]

to opts

which kicks in before https://github.com/mtth/avsc/blob/7cb76a6eb7b0d5b5a71c9cfb446acf3eff763006/lib/types.js#L159 to dynamically load a missing Schema (recursively).

joscha commented 2 months ago

Workaround (not pretty but works):

class DynamicTypeRegistry {
  private readonly registry: Record<string, Avro.Type> = {};
  private readonly proxy = new Proxy(this.registry, {
    get: (target, prop, receiver) => {
      if (prop.toString() in this.registry) {
        // should use `maybeQualify` here, if it was exported from avsc
        return this.registry[prop.toString()];
      }

      let unqualifiedProp = prop.toString();

      const { namespace } = this.opts;
      // could use `unqualify` here, if it was exported from avsc
      if (namespace && prop.toString().startsWith(namespace + '.')) {
        unqualifiedProp = prop.toString().slice(namespace.length + 1);
      }
      const fileName = `${__dirname}/avro-schema/${unqualifiedProp}.avsc`;
      if (!fs.existsSync(fileName)) {
        return undefined;
      }
      // avsc checks whether there is already a truthy entry in the registry and fais if there is one
      // but it will go into an infinite loop if we don't set it to undefined
      // should use `maybeQualify` here, if it was exported from avsc
      this.registry[prop.toString()] = undefined as any;
      let jsonData;
      try {
        jsonData = JSON.parse(fs.readFileSync(fileName, 'utf-8'))
      } catch(e) {
        console.error('Error reading file', fileName, e);
        throw new Error(`Error reading file ${fileName}`);
      }
      const schema = Avro.Type.forSchema(
        jsonData,
        {
          ...this.opts,
          registry: this.proxy,
        }
      )
      return schema
    }      
  })
  constructor(private readonly opts: Record<string, string> = {}) {
  }
  get() {
    return this.proxy
  }
}

and then:

const registry = new DynamicTypeRegistry({
  namespace: 'model',
}).get();
const schema = registry['ListEntryWithEntity'];
console.log(schema);
// pass `registry` into more calls to `Avro.Type.forSchema` at your own will - it is compatible with the one expected in `opts`, like so:
Avro.Type.forSchema(..., { registry });

which will recursively load a schema from a set of .avsc files in a directory, such as ones produced from https://openapi-generator.tech/docs/generators/avro-schema/

joscha commented 2 months ago

Just stumbled across the typeHook function. I think possibly it can made to be work with that, too, will give it a try in a second, although the code would probably need

--- a/lib/types.js
+++ b/lib/types.js
@@ -138,6 +138,7 @@ class Type {
       if (!Type.isType(type)) {
         throw new Error(`invalid typehook return value: ${j(type)}`);
       }
+      opts.registry[schema] = type;
       return type;
     }

A workaround could be to write into the registry inside the hook, but I wonder why types resolved by the typeHook function are not put into the registry in the first place? Is the reason more control for a typeHook provider or is it an oversight?

mtth commented 2 months ago

Hi @joscha. Yes - a typehook is the way to go here.

I wonder why types resolved by the typeHook function are not put into the registry in the first place? Is the reason more control for a typeHook provider or is it an oversight?

Adding the type to the registry is best left to caller. Not all types have a name and even when they do, the caller might want to pick a different one.

joscha commented 2 months ago

Yes - a typehook is the way to go here.

Okay, will give it a try and report back. Would you potentially be open to making qualify and maybeQualify public API of this package, so we can reuse them?

joscha commented 2 months ago

Okay, I've given the typeHook a try. This is an alternative solution to the DynamicTypeRegistry above:

/**
 * TODO(@joscha): reuse this one from `avsc`
 */
function unqualify(type: string) {
  return type.replace(/^.*\./, "");
}

function isQualifiedType(type: string) {
  return type.includes(".");
}

/**
 * TODO(@joscha): reuse this one from `avsc`
 */
function maybeQualify(type: string, namespace?: string) {
  return namespace ? `${namespace}.${type}` : type;
}

function loadLocalSchema(
  type: string,
  opts: Partial<Avro.ForSchemaOptions>,
): Avro.Type | undefined {
  if (isQualifiedType(type)) {
    if (opts?.namespace && !type.startsWith(opts.namespace + ".")) {
      throw new Error(
        `Qualified type does not match namespace; don't know how to resolve "${type}"`,
      );
    }
    type = unqualify(type);
  } else {
    // all our internal named types are namespaced, so this must be a primitive
    return;
  }
  const fileName = `${__dirname}/avro-schema/${type}.avsc`;

  let jsonData;
  try {
    jsonData = JSON.parse(fs.readFileSync(fileName, "utf-8"));
  } catch (e) {
    console.error("Error reading file", fileName, e);
    throw new Error(`Error reading file ${fileName}`);
  }
  return Avro.Type.forSchema(jsonData, opts);
}

const typeHook: Avro.ForSchemaOptions["typeHook"] = (schema, opts) => {
  // interestingly, the `typeHook` definition doesn't suggest that the `schema` parameter can be of type `string`,
  // but referenced non-primitives are passed with their `.name` property.
  if (typeof schema === "string") {
    if (schema in opts.registry) {
      return opts.registry[schema];
    }
    const ret = loadLocalSchema(schema, opts);
    if (ret) {
      opts.registry[maybeQualify(schema, opts.namespace)] = ret;
    }
    return ret;
  }
};

const opts: Partial<Avro.ForSchemaOptions> = {
  typeHook,
  namespace: "model",
};

// entry point to load a whole set of connected schemas, starting with a named type `model.Company`
const companySchema = Avro.Type.forSchema("model.Company", opts);

I am fairly happy with it, except it could use:

how would you feel about exposing these from avsc @mtth?

Bonus points for:

mtth commented 2 months ago

how would you feel about exposing these from avsc

I don't think it's worth increasing avsc's API surface just yet, but I'll consider exporting them if more use-cases come up.

In the example above, you can avoid the isPrimitive file existence check if all your schemas are namespaced.

allowing asynchronous typeHooks, then we could change all the synchronous file methods to their async counterparts.

This capability would be great but we would need to maintain two flavors of schema parsing, which is not a trade-off worth making. The general recommendation here is to do all async-friendly work upfront, before schema parsing. In your case, you could preload the files using parallel async calls and cache them for synchronous lookups in the type hook.

fixing the typeHook type definition to include | string

Thanks for flagging, this should get fixed. PR welcome.

joscha commented 2 months ago

how would you feel about exposing these from avsc

I don't think it's worth increasing avsc's API surface just yet, but I'll consider exporting them if more use-cases come up.

I think aligning the behaviour of the implementations outside avsc and inside would be great. Otherwise there's always a chance they diverge and suddenly after a bump of avsc my typehook won't work anymore like it should.

In the example above, you can avoid the isPrimitive file existence check if all your schemas are namespaced.

That's true. In my particular case, this is guaranteed, thank you for pointing this out.

allowing asynchronous typeHooks, then we could change all the synchronous file methods to their async counterparts.

This capability would be great but we would need to maintain two flavors of schema parsing, which is not a trade-off worth making. The general recommendation here is to do all async-friendly work upfront, before schema parsing. In your case, you could preload the files using parallel async calls and cache them for synchronous lookups in the type hook.

This assumes I know what schemas I am loading ahead of time. It also prevents the use of ad-hoc generated schemas and/or calls to external services. Whilst promises would be my preferred way, most other parts of avsc work with callbacks, so I actually think the current implementation of typeHook is an outlier. Adding an optional callback function shouldn't be too onerous for maintenance?

fixing the typeHook type definition to include | string

Thanks for flagging, this should get fixed. PR welcome.

Here you go: https://github.com/mtth/avsc/pull/475

mtth commented 2 months ago

Otherwise there's always a chance they diverge and suddenly after a bump of avsc my typehook won't work anymore like it should.

Namespacing is defined by the specification, so unlikely to diverge in breaking ways.

I agree that it would be convenient for your use-case to export these symbols. I'm just not convinced the maintenance cost for avsc of doing so is worth it yet.

Adding an optional callback function shouldn't be too onerous for maintenance?

We would need to maintain both synchronous and asynchronous schema creation functions. It would also be a large refactor, whether for callbacks or promises. For example logical type handling currently assumes that only one schema is being parsed at a time.

joscha commented 2 months ago

I agree that it would be convenient for your use-case to export these symbols. I'm just not convinced the maintenance cost for avsc of doing so is worth it yet.

Okay, fair enough. I think given that the package provides a typeHook interface that is likely to encounter namespaced entities it would make sense, but I agree it's not a lot of duplicated code.

We would need to maintain both synchronous and asynchronous schema creation functions. It would also be a large refactor, whether for callbacks or promises. For example logical type handling currently assumes that only one schema is being parsed at a time.

I think it would be worth it.

I am closing this for now as the initial issue has been solved.