mtth / avsc

Avro for JavaScript :zap:
MIT License
1.26k stars 144 forks source link

Infer schemas #40

Closed mtth closed 8 years ago

mtth commented 8 years ago

Adding a function (or stream) to infer Avro types from values.

For example:

avro.infer([123, 3.5]); // <FloatType>
avro.infer(['hi', ['hey']]); // <UnionType ["string", {"type": "array", "items": "string"}]>

Exposing a stream interface would be helpful as well. That would open up possibilities like compressing arbitrary long collections of records into an Avro file, without requiring a schema upfront (e.g. for logs).

Possible signature:

infer(vals, [opts])

dceejay commented 8 years ago

Would be great, but how would decode then work ? Or would you just unpack without a schema to check against ?

mtth commented 8 years ago

@dceejay - It depends on the use-case. For example:

Since many people aren't familiar with Avro's schema notation, having this schema-inferring ability would also make it easier to get started. You would just put in a few representative values and then use the generated schema as you wish. For complex structured data where the generated schema might not be optimal, it still provides a starting point which you can edit, which is much easier than starting from scratch.

dceejay commented 8 years ago

ab-so-lutely ! :+1: Ah - so this would be to build the schema... it's not (yet...) auto encode to binary format... Though I guess you could pipeline it.

hmm - I wonder if there would be a way to "fingerprint" each schema / message to auto create an ID, such that subsequent messages of same pattern can reuse that schema (and not incur generation overhead per message) - and get labelled with ID.

mtth commented 8 years ago

so this would be to build the schema... it's not (yet...) auto encode to binary format... Though I guess you could pipeline it.

Exactly.

I wonder if there would be a way to "fingerprint" each schema / message to auto create an ID, such that subsequent messages of same pattern can reuse that schema

This is definitely possible. In fact the fingerprint part is already implemented: type.getFingerprint.

dceejay commented 8 years ago

nice. Can't wait to try it :-)

I guess there is no obvious way to do an enum - but as you said it would be a great way to get started.

ms440 commented 8 years ago

That would be great!

mtth commented 8 years ago

Small update: the public API for this isn't quite ready yet, but the repo now includes a script to infer a schema from a given value, in case you'd like to try it out. Sample commands below.

$ ./etc/scripts/infer '[1, 2, 3]' # An array of integers.
{"type":"array","items":"int"}

$ ./etc/scripts/infer '[1, 2, 3, 4.5]' # Adding a float.
{"type":"array","items":"float"}

$ ./etc/scripts/infer '{"id": 48, "on": true}' # A sample record.
{"name":"_0","type":"record","fields":[{"name":"id","type":"int"},{"name":"on","type":"boolean"}]}
mtth commented 8 years ago

Released as part of 4.1.0!

Helpful links:

In particular the last script will let you infer the type from any file containing JSON values:

$ cat example.jsonl
{"id": 12, "title": "Hello"}
{"id": 3}
{"id": 4, "title": "Hey"}

$ ./etc/scripts/infer <example.jsonl
{
  "type": "record",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "title", "type": ["null", "string"], "default": null}
  ]
}
dceejay commented 8 years ago

excellent - thanks.