Add optional ID string mapping to format 2.

garretrieger commented 5 months ago

Adds the ability to optionally assign entries in a format 2 table opaque string id's (instead of numeric ids).

Also widen several of the format 2 fields:

entryCount (and associated copyIndices) uint16 -> uint24.
designSpaceCount uint8 -> uint16 to match the max number of axes in fvar.
entryIdDelta int16 -> int24 to allow for bigger gaps, and a higher maximum id value.

garretrieger commented 5 months ago

@skef during todays call you mentioned potential issues in dealing with percent encoding and the d1 through d4 variables when strings contain characters outside of the url reserved/unreserved sets. I reviewed the URI template specification and found that it provides specific guidance for how to handle these cases, in summary:

The input variables to URI templates are a sequence of unicode codepoints.
During expansion these will be UTF8 encoded and then percent encoded to make them url safe.

Since handling these cases is well defined and will need to be handled when implementing a URI template processor (if not reusing an existing implementation) I don't think we need to completely ban the use of non reserved/unreserved in the id values. So instead what I did is:

added some clarification to the URI template section on what a "character" means for d1-d4
Added an example of a URI template substitution using a non-ascii codepoint + d1-d4 variables.
added guidance (as "should" not "must") to avoid unreserved URI characters in the id strings.

Can you take another look at let me know what you think?

skef commented 5 months ago

I guess I'm having a bit of a Pandora's box reaction to this.

What we do here has (or, I guess, might need to) work with both the URL and, in the normal case, whatever filesystem is being used to store the patches. So, what about case-insensitive filesystems? What about characters that can't be encoded in a given filesystem, or can't be without escaping?

When the value is encoded in the URL does the server-side un-encode before searching the filesystem (I guess that would be normal ...)? If it's a goal to not need a server-side, how does the client predict what characters might need escaping in the server-side filesystem to make the request? Or if the encoder adds unusual characters to the ID is it just the server-side's problem to map incoming URLs to binary blobs, perhaps explicitly in some cases?

What we had was so simple and didn't raise these questions.

garretrieger commented 5 months ago

I guess I'm having a bit of a Pandora's box reaction to this.

What we do here has (or, I guess, might need to) work with both the URL and, in the normal case, whatever filesystem is being used to store the patches. So, what about case-insensitive filesystems? What about characters that can't be encoded in a given filesystem, or can't be without escaping?

When the value is encoded in the URL does the server-side un-encode before searching the filesystem (I guess that would be normal ...)?

Yes, server side would undo the percent encoding before matching to a file name.

If it's a goal to not need a server-side, how does the client predict what characters might need escaping in the server-side filesystem to make the request? Or if the encoder adds unusual characters to the ID is it just the server-side's problem to map incoming URLs to binary blobs, perhaps explicitly in some cases?

In my mind this is purely a responsibility of the encoder. The client will just process the URL templates according to the URL template spec and request whatever URL it gets as a result. It will be up to encoder implementations to pick files names (and the corresponding URLs) that will work correctly in whatever combination of file system + http servers it intends to support. If encoders use numeric IDs (which should produce the smallest encoding) this won't be an issue, and if they do for some reason chose to use string IDs then the text as I've written it currently has "should" level guidance to stick to URL safe characters which will also be safe. If an encoder chooses to go outside of the "should" guidance they do so at their own risk and need to understand the implications of doing so (for context see the definition of "should" as used in this spec https://datatracker.ietf.org/doc/html/rfc2119#section-3).

I expect that pretty much all file based encoders will stick to numeric ids as that will produce the smallest encoding in nearly all cases, however I think I'll also add some guidance that explicitly recommends encoders use numeric IDs for any file based encoding.

I will also expand on the guidance I've provided a bit further to highlight the risks you've mentioned here.

skef commented 5 months ago

What is the downside of narrowing the allowed characters for a custom ID so that the result is highly likely to be filesystem-neutral? It seems very preferable to me if in almost all cases an encoded font can be moved from one filesystem to another without having to think about this stuff. What are we buying with the extra flexibility in the custom ID set?

garretrieger commented 5 months ago

What is the downside of narrowing the allowed characters for a custom ID so that the result is highly likely to be filesystem-neutral? It seems very preferable to me if in almost all cases an encoded font can be moved from one filesystem to another without having to think about this stuff. What are we buying with the extra flexibility in the custom ID set?

Mandating a specific allowed character set will then require the client to implement logic to enforce it. If we don't mandate the client enforce the restrictions then the restrictions end up being just a suggestion. Furthermore we'd have to enforce the same restrictions on the URI templates, since you could have the template include non-ascii unicode codepoints even if none of the ID strings had them. I'd like to avoid having to override parts of the template specification, it's cleaner to just use it as is.

Beyond that since URI templates, URIs (via percent encoding), modern file systems and HTTP servers all support unicode; I don't think we should be artificially restricting what can be used. In my opinion we should leave freedom for services/encoders to choose what works best for their use case. If a specific encoder implementation wants to have the output be ultra portable even to legacy file systems then we've provided a clear pathway for that to be done.

garretrieger commented 5 months ago

What is the downside of narrowing the allowed characters for a custom ID so that the result is highly likely to be filesystem-neutral? It seems very preferable to me if in almost all cases an encoded font can be moved from one filesystem to another without having to think about this stuff. What are we buying with the extra flexibility in the custom ID set?

Mandating a specific allowed character set will then require the client to implement logic to enforce it. If we don't mandate the client enforce the restrictions then the restrictions end up being just a suggestion. Furthermore we'd have to enforce the same restrictions on the URI templates, since you could have the template include non-ascii unicode codepoints even if none of the ID strings had them. I'd like to avoid having to override parts of the template specification, it's cleaner to just use it as is.

Beyond that since URI templates, URIs (via percent encoding), modern file systems and HTTP servers all support unicode; I don't think we should be artificially restricting what can be used. In my opinion we should leave freedom for services/encoders to choose what works best for their use case. If a specific encoder implementation wants to have the output be ultra portable even to legacy file systems then we've provided a clear pathway for that to be done.

An example where the restrictions could be problematic: let's say someone wants to host a IFT font on a sub path of some HTTP service, but they don't have control to change the naming of the path above where they are going to place the patch files (eg. http://foo.bar/my%20path/). Now if "my path" happens to contain a character we've disallowed in the ID strings and the URI template, like " " in this case, then they would be unable to use an IFT font which referred to patch files at that location. Ultimately we should not be restricting services from using any valid URLs for the patches since they may not want too or have the ability to change their URL schemes to conform to restrictions we add as part of IFT.

skef commented 5 months ago

OK, that's normally dealt with by having the URL be relative but I see the point.

w3c / IFT

Add optional ID string mapping to format 2. #165