Closed garretrieger closed 5 months ago
@skef during todays call you mentioned potential issues in dealing with percent encoding and the d1 through d4 variables when strings contain characters outside of the url reserved/unreserved sets. I reviewed the URI template specification and found that it provides specific guidance for how to handle these cases, in summary:
Since handling these cases is well defined and will need to be handled when implementing a URI template processor (if not reusing an existing implementation) I don't think we need to completely ban the use of non reserved/unreserved in the id values. So instead what I did is:
Can you take another look at let me know what you think?
I guess I'm having a bit of a Pandora's box reaction to this.
What we do here has (or, I guess, might need to) work with both the URL and, in the normal case, whatever filesystem is being used to store the patches. So, what about case-insensitive filesystems? What about characters that can't be encoded in a given filesystem, or can't be without escaping?
When the value is encoded in the URL does the server-side un-encode before searching the filesystem (I guess that would be normal ...)? If it's a goal to not need a server-side, how does the client predict what characters might need escaping in the server-side filesystem to make the request? Or if the encoder adds unusual characters to the ID is it just the server-side's problem to map incoming URLs to binary blobs, perhaps explicitly in some cases?
What we had was so simple and didn't raise these questions.
I guess I'm having a bit of a Pandora's box reaction to this.
What we do here has (or, I guess, might need to) work with both the URL and, in the normal case, whatever filesystem is being used to store the patches. So, what about case-insensitive filesystems? What about characters that can't be encoded in a given filesystem, or can't be without escaping?
When the value is encoded in the URL does the server-side un-encode before searching the filesystem (I guess that would be normal ...)?
Yes, server side would undo the percent encoding before matching to a file name.
If it's a goal to not need a server-side, how does the client predict what characters might need escaping in the server-side filesystem to make the request? Or if the encoder adds unusual characters to the ID is it just the server-side's problem to map incoming URLs to binary blobs, perhaps explicitly in some cases?
In my mind this is purely a responsibility of the encoder. The client will just process the URL templates according to the URL template spec and request whatever URL it gets as a result. It will be up to encoder implementations to pick files names (and the corresponding URLs) that will work correctly in whatever combination of file system + http servers it intends to support. If encoders use numeric IDs (which should produce the smallest encoding) this won't be an issue, and if they do for some reason chose to use string IDs then the text as I've written it currently has "should" level guidance to stick to URL safe characters which will also be safe. If an encoder chooses to go outside of the "should" guidance they do so at their own risk and need to understand the implications of doing so (for context see the definition of "should" as used in this spec https://datatracker.ietf.org/doc/html/rfc2119#section-3).
I expect that pretty much all file based encoders will stick to numeric ids as that will produce the smallest encoding in nearly all cases, however I think I'll also add some guidance that explicitly recommends encoders use numeric IDs for any file based encoding.
I will also expand on the guidance I've provided a bit further to highlight the risks you've mentioned here.
What is the downside of narrowing the allowed characters for a custom ID so that the result is highly likely to be filesystem-neutral? It seems very preferable to me if in almost all cases an encoded font can be moved from one filesystem to another without having to think about this stuff. What are we buying with the extra flexibility in the custom ID set?
What is the downside of narrowing the allowed characters for a custom ID so that the result is highly likely to be filesystem-neutral? It seems very preferable to me if in almost all cases an encoded font can be moved from one filesystem to another without having to think about this stuff. What are we buying with the extra flexibility in the custom ID set?
Mandating a specific allowed character set will then require the client to implement logic to enforce it. If we don't mandate the client enforce the restrictions then the restrictions end up being just a suggestion. Furthermore we'd have to enforce the same restrictions on the URI templates, since you could have the template include non-ascii unicode codepoints even if none of the ID strings had them. I'd like to avoid having to override parts of the template specification, it's cleaner to just use it as is.
Beyond that since URI templates, URIs (via percent encoding), modern file systems and HTTP servers all support unicode; I don't think we should be artificially restricting what can be used. In my opinion we should leave freedom for services/encoders to choose what works best for their use case. If a specific encoder implementation wants to have the output be ultra portable even to legacy file systems then we've provided a clear pathway for that to be done.
What is the downside of narrowing the allowed characters for a custom ID so that the result is highly likely to be filesystem-neutral? It seems very preferable to me if in almost all cases an encoded font can be moved from one filesystem to another without having to think about this stuff. What are we buying with the extra flexibility in the custom ID set?
Mandating a specific allowed character set will then require the client to implement logic to enforce it. If we don't mandate the client enforce the restrictions then the restrictions end up being just a suggestion. Furthermore we'd have to enforce the same restrictions on the URI templates, since you could have the template include non-ascii unicode codepoints even if none of the ID strings had them. I'd like to avoid having to override parts of the template specification, it's cleaner to just use it as is.
Beyond that since URI templates, URIs (via percent encoding), modern file systems and HTTP servers all support unicode; I don't think we should be artificially restricting what can be used. In my opinion we should leave freedom for services/encoders to choose what works best for their use case. If a specific encoder implementation wants to have the output be ultra portable even to legacy file systems then we've provided a clear pathway for that to be done.
An example where the restrictions could be problematic: let's say someone wants to host a IFT font on a sub path of some HTTP service, but they don't have control to change the naming of the path above where they are going to place the patch files (eg. http://foo.bar/my%20path/
OK, that's normally dealt with by having the URL be relative but I see the point.
Adds the ability to optionally assign entries in a format 2 table opaque string id's (instead of numeric ids).
Also widen several of the format 2 fields:
Preview | Diff