Thoughts on <note> and other string data

cmyr commented 3 years ago

Was just looking a bit into ufonormalizer's current behaviour, and ran into a funny one, which is the modification of the whitespace in the <note> tag.

To me, the <note> field contains an arbitrary string, which is user data, and ufonormalizer should not modify that string. For instance, if I post some snippet of python code into the note field of a glyph in my font editor, I would expect whitespace to be preserved; currently ufonormalizer will modify it.

Is there a rationale here that I'm missing? for the purposes of normalization, it should be fine to just never touch the contents of the <note> tag; this will still produce consistent output. (ignoring moving the tag, with its unmodified contents, which would be expected).

If it is desirable to modify the contents of <note>, should the normalizer also be modifying the contents of any <string> values in any <lib> sections? These also contain arbitrary string data, and maybe it makes sense to be consistent?

benkiel commented 3 years ago

Colin, you'll see the rational here: https://github.com/unified-font-object/ufoNormalizer/pull/72. Let me know if that answers things or not.

benkiel commented 3 years ago

to add to this, the style (as I understand it) is to have the value indented between tags, that issue fixed a bug in the normalizer that was adding extra tabs. Possible that it had an unintended consequence, though the tests seemed good for it. Have a look and let me know

cmyr commented 3 years ago

I saw that, and unless I'm missing something I don't think it answers my question, which is whether normalizer should be touching the contents of <note> at all; that seems to be working from the assumption that this behaviour is desired, and just trying to make it consistent?

Is there a specific style that you're referring to? my naive interpretation would be that indentation refers only to elements, and not to their contents.

As a (non-authoritative) example, if I run the following string through the first online XML normalizer I could find (https://www.freeformatter.com/xml-formatter.html):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>adjustmentTimestamp</key>
<date>2021-06-26T03:14:38Z</date>
<key>coolString</key>
<string>This is a string. it consists of several
lines,
     spaces
    tab
fun
</string>
</dict>
</plist>

I get,

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
   <dict>
      <key>adjustmentTimestamp</key>
      <date>2021-06-26T03:14:38Z</date>
      <key>coolString</key>
      <string>This is a string. it consists of several
lines,
     spaces
    tab
fun</string>
   </dict>
</plist>

that is to say, it indents the elements, but does not modify the contents of the <string> element in any way.

edit: it does modify the contents of <string>, in that it will apparently strip and leading and trailing whitespace. 🤷

One way to think about this: (I think that) XML should losslessly roundtrip. As things stand, a tool opening a UFO file would need to know whether or not the file had been normalized, in order to attempt to determine what the original content was (and may not be able to determine this anyway?)

benkiel commented 3 years ago

Going through the code and tests, here are my thoughts (note: I did not write any of the ufonormalizer code, so this is all conjecture).

Looking at the tests, it seems that the normalizer has always transformed the note tag by placing the contents indented between an opening <note> and a closing </note> tag, much like how the <dict> tag is handled. My assumption for this behavior is that the note is intended as a multiline element, unlike most <string> elements, and that the <note> was going to contain multiline text, so the formatting matches what one would expect in a docstring, it's nicer/easier to read.

I see from your attached issue that the thing you are putting python code in the <note> and the changing of the indents then breaks the code if one doesn't know to dedent the note before running the code, and —yes— there is no way for one to know if the normalizer has been run on a ufo.

While I agree with the intent of the original behavior, it is much easer to scan/read that way, I do agree that it likely shouldn't change the meaning of things that really need indents to stay as they are.

The "best" way I can think of doing this would be to keep the behavior of having <note> and <\note> on their own lines, and then not changing any whitespace. This would mean that existing normalized UFOs wouldn't change when normalized. Does that seem like a reasonable solution to this?

benkiel commented 3 years ago

(and if you find it reasonable, could you make PR for this?)

cmyr commented 3 years ago

The fact that this would not cause any 'reformatting' of already normalized ufos is something I hadn't considered, and is very encouraging; it means if we make this change it shouldn't annoy anyone already using the tool.

I'll take a look at this next week, thank you for being so responsive!

unified-font-object / ufoNormalizer

Thoughts on <note> and other string data #84