w3c / mathml

MathML4 editors draft
https://w3c.github.io/mathml/
Other
64 stars 19 forks source link

Make MathML attributes ASCII case-insensitive #178

Open fred-wang opened 4 years ago

fred-wang commented 4 years ago

This is a follow-up of #22 ; we decided to follow HTML/CSS which treat things as ASCII case-insensitive. Concretely, ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt has this line:

017F; C; 0073; # LATIN SMALL LETTER LONG S

which means that falſe is case-insensitively equal to false. However, it is not ASCII case-insensitively equal to false (only a-z <-> A-Z equivalence are considered in that case).

Currently, the MathML Core spec just says "case-insensitive".

Note: for CSS colors, I reported https://github.com/w3c/csswg-drafts/issues/4599

fred-wang commented 4 years ago

I did a quick check and for the MathML-specific definitions, I only see case-insensitive against strings with ASCII letters and dashes. So the only difference would be for "LATIN SMALL LETTER LONG S", "KELVIN SIGN" and maybe a few "LATIN SMALL LIGATURE" (e.g. double-STruck). Unlikely for "LATIN CAPITAL LETTER I WITH DOT ABOVE" if the Turkish rule is used. See https://github.com/w3c/csswg-drafts/issues/4599#issuecomment-565794132

NSoiffer commented 4 years ago

That's a good catch. I'm pretty sure we all agree that we only mean ASCII case-insensitivity. I suggest we add the following to the spec, which is a slight rewording from the HTML spec:

Many strings in the HTML and CSS syntax (e.g. the names of elements and their attributes) are case-insensitive, but only for ASCII upper alphas and ASCII lower alphas. For convenience, in this specification this is just referred to as "case-insensitive".

I suggest this goes into Appendix G.1: Document Conventions.

fred-wang commented 4 years ago

I would prefer to be explicit everywhere and use "ASCII case-insensitive" with a link to https://infra.spec.whatwg.org/#ascii-case-insensitive ; this seems to be what the HTML and CSS specifications do (or how they would be fixed it e.g. https://github.com/w3c/csswg-drafts/issues/4599#issuecomment-565599403). I'm sure if we just keep case-insensitive as it is now, people will easily not read the appendix. We should also avoid duplicating definition from HTML5 as it was mentioned in another issue.

fred-wang commented 4 years ago

Consensus from 2019/12/16: Move to ASCII case-insensitiveness

fred-wang commented 4 years ago

These are the attributes, with the behavior changes that will require tests:

Other attributes rely on CSS ( https://mathml-refresh.github.io/mathml-core/#types-for-mathml-attribute-values ) so nothing is changed here (although tests can always be added).

davidcarlisle commented 4 years ago

If we want to keep the relax schema there are two choices either we could say in words that values should be ascii-lowercased before validation or we could make the schema do the case insensitive match.

That would mean for example changing

attribute mathvariant {"normal" | "bold" | "italic" | "bold-italic" | "double-struck" | "bold-fraktur" | "script" | "bold-script" | "fraktur" | "sans-serif" | "bold-sans-serif" | "sans-serif-italic" | "sans-serif-bold-italic" | "monospace" | "initial" | "tailed" | "looped" | "stretched"}?,

to

attribute mathvariant {xsd:string{pattern="[Nn][Oo][Rr][Mm][Aa][Ll]|[Bb][Oo][Ll][Dd]|[Ii][Tt][Aa][Ll][Ii][Cc]|[Bb][Oo][Ll][Dd]-[Ii][Tt][Aa][Ll][Ii][Cc]|[Dd][Oo][Uu][Bb][Ll][Ee]-[Ss][Tt][Rr][Uu][Cc][Kk]|[Bb][Oo][Ll][Dd]-[Ff][Rr][Aa][Kk][Tt][Uu][Rr]|[Ss][Cc][Rr][Ii][Pp][Tt]|[Bb][Oo][Ll][Dd]-[Ss][Cc][Rr][Ii][Pp][Tt]|[Ff][Rr][Aa][Kk][Tt][Uu][Rr]|[Ss][Aa][Nn][Ss]-[Ss][Ee][Rr][Ii][Ff]|[Bb][Oo][Ll][Dd]-[Ss][Aa][Nn][Ss]-[Ss][Ee][Rr][Ii][Ff]|[Ss][Aa][Nn][Ss]-[Ss][Ee][Rr][Ii][Ff]-[Ii][Tt][Aa][Ll][Ii][Cc]|[Ss][Aa][Nn][Ss]-[Ss][Ee][Rr][Ii][Ff]-[Bb][Oo][Ll][Dd]-[Ii][Tt][Aa][Ll][Ii][Cc]|[Mm][Oo][Nn][Oo][Ss][Pp][Aa][Cc][Ee]|[Ii][Nn][Ii][Tt][Ii][Aa][Ll]|[Tt][Aa][Ii][Ll][Ee][Dd]|[Ll][Oo][Oo][Pp][Ee][Dd]|[Ss][Tt][Rr][Ee][Tt][Cc][Hh][Ee][Dd]"}}?,

which works but isn't very human readable or informative.

Since we already need some pre-processing described in words to allow data-foo attributes (or onfoo attributes to be ignored, I'm tempted to suggest we keep the existing string match but could be persuaded otherwise....

fred-wang commented 4 years ago

I think this was already the case since #22 ; not sure how important it is for legacy XML applications. I wonder what is done for HTML5 ?

davidcarlisle commented 4 years ago

On Tue, 17 Dec 2019 at 13:07, Frédéric Wang notifications@github.com wrote:

I think this was already the case since #22 https://github.com/mathml-refresh/mathml/issues/22 ; not sure how important it is for legacy XML applications. I wonder what is done for HTML5 ?

The validator.nu html5 validator has a relaxng schema at its core but heavily preprocesses the document with custom code before validating it, so I think pre-processing is fine (and makes the schema a lot easier to read)

ByteEater-pl commented 4 years ago

legacy XML applications

Could you, please, define this term, @fred-wang? I don't know which XML applications are legacy and which aren't.

fred-wang commented 4 years ago

legacy XML applications

Could you, please, define this term, @fred-wang? I don't know which XML applications are legacy and which aren't.

I believe I was talking about XML-based MathML3 implementations.

fred-wang commented 4 years ago

Removing "tests" label, we have tests for mathsize and dir. It's not exhaustive, but HTML or CSS don't test exhaustively either...

Also removing core label since the only remaining changes are in mathml full

NSoiffer commented 6 days ago

Adding back spec update label because MathML 4 spec needs to mention this. I'm not sure where. The only mention of case-insensitivity in the full spec is in "A.1 Validating MathML". Maybe we can add a sentence ("All attribute names are ASCII-case insensitive") in "2.1.5 MathML Attribute Values" after it says "Attribute names are shown in a monospaced font throughout this document."

@davidcarlisle: suggestions?

davidcarlisle commented 6 days ago

@NSoiffer you mean attribute values not attribute names I think?

davidcarlisle commented 6 days ago

I dn't think it can be all values eg in xml ids are case sensitive, we have defined intent to be case sensitive (I think) and alttext etc don't pass on the lowercase value. The main point is to make anything that has type mathml-boolean ascii case sensitive (which is what the schema says so you can use stretchy=fAlse

which means we should change instances of "true " | "false" in tables such as https://w3c.github.io/mathml/#presm_mo_dict_attrs to mathml-boolean and define that somewhere in words to be a case insensitiv ematch

dginev commented 6 days ago

I was trying to understand the reference link: https://infra.spec.whatwg.org/#ascii-case-insensitive

It apears to define "ASCII case-insensitive match".

To me that reads as a more operational definition. Instead of stating that certain values are case insensitive, it appears to introduce comparisons which are done with a normalization step before the equality test.

I am not sure how that is used in the HTML spec text, but it ought to be doable to modify relevant MathML algorithms (e.g. matching an intent concept) to fit in, assuming we can find all of them.

davidcarlisle commented 6 days ago

@dginev

To me that reads as a more operational definition.

well it's both an operational definition but it also affects syntax.

For MathML it mainly affects any boolean attribute plus some other enumerated lists such as mathvariant

so the ascii case insensitive change means that unlike MathML3

stretchy="TRUE"

is valid, but also with the operational definition that it should be treated like

stretchy="true"

similarly mathvariant="BOLD-fraktur is valid and should work like bold-fraktur

But that doesn't mean all attributes should be lowercased,. anything with text or a URL for example can not be. It really just applies to attributes that take a fixed enumerated list of values.

NSoiffer commented 6 days ago

Sorry, yes I meant values -- it was late at night...

In https://w3c.github.io/mathml/#fund_attval, we define what the syntax is for attribute values. One of the values is "string" which currently says

string | an arbitrary, nonempty and finite, string of characters

We could add something like "With the exception of intent values, string values are ASCII-case insensitive when being matched."

That assumes that we want this for all non-intent string values. If not, then I think every instance needs to say whether it is ASCII-case insensitive or not. A bit ugly, but that is what core does.

polx commented 6 days ago

IDs are also attributes that are not to be considered case-insensitive.

dginev commented 6 days ago

A general proclamation that all matches are case-insensitive sounds dangerous to me. The HTML spec mentions each case where that rule applies, examples:

There are at least 140 references to ASCII case-insensitive match in the HTML spec.

It is likely a bit painful, but I suggest we identify which algorithms in the MathML Full text permit this treatment and specify per-algorithm. Without a globally applicable rule.

davidcarlisle commented 6 days ago

@NSoiffer I think it needs to be per type so basically:

lengths, colors, mathml-boolean and mathvariants.

id, alttext, etc can't be case insensitive and <mfenced open="A"> shouldn't use an a (not that A is that common a delimiter either) bit also the attributes inherited from html like onclick=... can not be case insensitive as they hold Javascript code.

NSoiffer commented 5 days ago

I was being hopeful that we didn't need to say it everywhere, but I'm convinced we do. That's a chunk of work, but maybe there is an emacs-macro someone has already written to do it :-)

davidcarlisle commented 5 days ago

I don't think we need to say this in so many places, just in the definition of the types such as boolean and length-percentage, we already specify the type of all attributes

NSoiffer commented 5 days ago

We don't (currently) have a type "boolean", at least not in the main text. Types are defined here. I only see it defined for the schema.

Take a look at the table attributes: there are a lot that take on string values and it seems that every one of them needs to say they are ASCII-case insensitive. Most elements don't have that many attributes, but most elements do have attribute values that take on string values that should be ASCII-case insensitive.

Am I misunderstanding which should be ASCII-case insensitive?

davidcarlisle commented 5 days ago

@NSoiffer sorry I shouldn't answer before coffee.

I had in mind that there were some "custom enumerations" like mathvariant that didn't use a defined type, but I'd forgottem about all the table ones. Also we seem to mark all the booleans as "true" | "false" (copied over from MathML3) I think for those we should globally edit to reference a defined boolean type at the place you give.

It might also be possible to define "enumeration" in that chapter 2 types list, defined as an ascii case-insensitive list and state there that all attributes given as a | list of possible strings are ascii case-insensitive.

NSoiffer commented 2 days ago

With David's PR, the boolean values are now ASCII case-insensitive match. He claims (I haven't verified), that Full is now in-sync with Core wrt attribute value case sensitivity. The open question remains: do we want to make all the other "enumeration" types such as the table attrs mentioned above case sensitive or case insensitive?

The easy path is to leave things as they are now, which is what V3 says in "2.1.5.3 Color Valued Attributes"

Note that the color name keywords are not case-sensitive, unlike most keywords in MathML attribute values, for compatibility with CSS and HTML.

That statement is not in a great place for people to find...

MathML 4 is currently silent on case-sensitivity except for (now) boolean values and for concept names. Taking the easy compatible path is to add a line somewhere saying unless otherwise specified all attribute values are case-sensitive.

The breaking change (which we sort have done with boolean values to be compatible with HTML), it to track down all the instances where they should change to be ASCII case-insensitive match, and make the changes there.

Thoughts?

dginev commented 2 days ago

I think lengths could also be case-insensitive for CSS+HTML compatibility.

As David suggests the enumerable types are also good candidates, since normalized matching won't introduce any new side-effects. We could even try to generalize there, with a statement like "all matches on 'enumerated' attribute values are ascii case-insensitive" ? As long as we define that category clearly...