Language agnostic schema

rsheeter commented 1 year ago

Once #84 is done we're getting close to the codegen input being language agnostic. It "just" needs to be a form other than Rust and to only have attributes that make sense beyond Rust. Strawman to incite debate: use https://toml.io/en/, it's simple, widely supported, supports comments, and more than sufficient to capture what we need.

Today

/// [COLR (Color)](https://learn.microsoft.com/en-us/typography/opentype/spec/colr#colr-header) table
table Colr {
    /// Table version number - set to 0 or 1.
    #[version]
    version: u16,
    /// Number of BaseGlyph records; may be 0 in a version 1 table.
    num_base_glyph_records: u16,
    /// Offset to baseGlyphRecords array (may be NULL).
    #[nullable]
    #[read_offset_with($num_base_glyph_records)]
    base_glyph_records_offset: Offset32<[BaseGlyph]>,

Tomorrow

tag = "COLR"
root = "Colr"

[table.Colr]
comment = "[COLR (Color)](https://learn.microsoft.com/en-us/typography/opentype/spec/colr#colr-header) table"

[table.Colr.version]
type = "u16"
attrib = ["version"]
comment = "Table version number - set to 0 or 1."

[table.Colr.num_base_glyph_records]
type = "u16"
comment = "Number of BaseGlyph records; may be 0 in a version 1 table."

[table.Colr.base_glyph_records_offset]
type = "Offset32<BaseGlyph[num_base_glyph_records]>"
attrib = ["nullable"]
comment = "Offset to baseGlyphRecords array (may be NULL)."

rsheeter commented 1 year ago

@dfrg notes it would be helpful if we explicitly captured what can/cannot be constructed (which is currently hidden/internal)

cmyr commented 1 year ago

This is a sketch for the general structure of a schema. The intent here is to figure out a structure capable of representing all of the things that we would like to know about a font table. It is written in a format-agnostic style; we would pick an actual format if we choose to implement this.

this is incomplete. The intention here is to show generally what this would look like, and I can persue it if there is consensus that this is a useful line of inquiry.

Note: the structure I've chosen here is ad-hoc and infinitely bike-sheddable; it can also be discussed if we decide to proceed.

Type

A Type is a string, which is one of either:

the core data types (uint16, Fixed, etc)
the conventional data types (GlyphId, NameId, MajorMinor?)
the name of a table, record, flagset defined elsewhere.

Table

A table object has the following fields:

field	type	required	notes
name	string	yes	the name of the table
sfnt tag	Tag	no	the sfnt tag for this table, if it is top-level
short doc	string	yes	a short description of this table
long doc	string	no	additional information about this table
doc link	string	yes	a link to online documentation for this table
input args	[InputArgument]	no	only if this table requires external data to be parsed
formats	[FormatTable]	no	a list of table formats. must not exist if 'fields' exists
fields	[Field]	no	a list of fields. must not exist if 'formats' exists

the 'formats' and 'fields' fields are mutually exclusive, and one must be present.
if 'formats' is present, all entries must have the same 'format type', and distinct 'format's.

InputArgument

An input argument is a name and a type.

field	type	required	notes
name	string	yes	the name of this argument, used in the containing table
type	Type	yes	the type of the argument

FormatTable

A single format of a multi-format table.

field	type	required	notes
format type	Type	yes	the type of the format value, e.g. uint16
format	int	yes	the format value. Must be valid for 'format type'
table	Table	yes	the Table for this format.

the Table's first field must be a format with the type listed here

Field

A field is a named value at a given position in a Table or Record.

field	type	required	notes
name	String	yes	the name of the field
type	Type	yes	the type of the field
doc	string	yes	a short description of this field
offset	OffsetInfo	no	required if this field is an offset
count	CountInfo	no	required if this field is an array or sequence

OffsetInfo

TK

CountInfo

CountInfo is additional information for computing the length of a sequence or array.

This has two parts. The first is the source of the count value, which is generally either the name of a sibling field or a literal. The second part identifies a possible transformation applied to this value.

field	type	required	notes
value	CountValue	yes	indicates the input value for computing the count
transform	CountTransform	no	a token identifying a computation on the input value

CountValue

CountValue represents the source for the base input value used to compute the count.

field	type	required	notes
field	String	no	the name of a field or 'input arg'
literal	int	no	a literal integer
all	()	no	a flag indicating that sequence consumes the rest of the table's data

Exactly and only one of these fields must be present.

CountTransform

The count transform is an enum, serialized as an integer, with the following defined values:

name	value	function
MINUS_ONE	1	subtract 1 from the input
DIVIDE_BY_TWO	2	divide the input value by 2

unhandled: Device table delta values

Record

similar to table*

FlagSet

Enum

rsheeter commented 1 year ago

Awesome, ty. I like it, think it is valuable to pursue, and with my own biases fully intact think this would transform magnificently to something like toml :) I really want to try making a python reader off such a generic schema, I think that would be a very interesting exercise that might surface interesting things.

EDIT: at mild risk of overthinking things, maybe we could have an abnf. My immediate thought is a narrowing of https://github.com/toml-lang/toml/blob/main/toml.abnf.

googlefonts / fontations