source-foundry / ufofmt

A fast, flexible UFO source file formatter based on the Rust Norad library
Apache License 2.0
7 stars 1 forks source link

Support custom single / double quote formatting in write files #19

Closed chrissimpkins closed 3 years ago

chrissimpkins commented 3 years ago

norad's default is double quote formatting of the XML declaration attribute definitions in all UFO XML / plist files.

This is another source of multi-file diffs across UFO serializer tools. Some format the attribute definitions with single quotes.

The most flexible, useful formatting approach may be one where the user is permitted to define what they want?

This is not currently supported in the norad lib / its plist lib dependency TMK.

Related #18

chrissimpkins commented 3 years ago

This seems to be limited to the XML declaration that is found as the first line in every XML / plist file:

<?xml version="1.0" encoding="UTF-8"?>

ufoLib/ufoLib2/defcon use single quotes

norad/ufonormalizer use double quotes

Double quotes appear to be used in all other areas of the XML attributes across the UFO serializers that we've analyzed to date.

cmyr commented 3 years ago

Looked into this, and at the very least it's annoying; would require some upstream changes to the various plist serialization libraries.

chrissimpkins commented 3 years ago

I've been giving it some thought on this side outside of the norad library. We could do a second-pass walk of the UFO dir and do a "quick" byte value replacement on those exact double quote positions in the first line so that they become single quotes. The XML declaration line is fixed ASCII with no space indentation, always has double quotes coming out of norad. The length of the Vec<u8> wouldn't change. This would avoid the need to allocate a UTF-8 checked String for every XML file and seems to be the fastest approach (to my naive mind) if we can't do it at the norad UFO serializer level. Ideally we would just replace the first line of text in every file with a string literal because we know the exact string that we want but I can't figure out an approach that does that without reading every line of the file in and allocating a new String / Vec<u8> per line. Maybe I am over thinking it and just tossing regex with a simple single char replacement x expected number of replacements at it is fast enough. IIUC it drops down to memchr for sub-4 char, non-regex searches, supports skipping the UTF8 checks via bytes module (and allows you to offload all of the Unicode data in the regex library), and is very fast.

The re-read and re-write of the entire UFO dir XML file structure is terribly inefficient. A possible backup plan if we want to support this feature and it isn't appropriate/can't be supported in the norad serialization approach. This will be useful to those who want to get back to ufoLib/ufoLib2 style formatting and I suppose those who need this might need to suffer the performance consequences?

cmyr commented 3 years ago

That's a good point, about substitution, and if we wanted this I would be inclined to put it in norad; we can do it before we write, and should be reasonably efficient.

If we wanted to be evil geniuses about this we could even have a custom Writer type that inspects all bytes written to it, and if it finds the sequence <?xml version="1.0" encoding="UTF-8"?> replaces it with <?xml version='1.0' encoding='UTF-8'?>, this would be transparent and very low overhead.

chrissimpkins commented 3 years ago

Here's a rough implementation to see how the performance hit looks with a second set of UFO XML path r/w's for the single quote replacements: https://github.com/source-foundry/ufofmt/pull/25

Best of three times:

without single quote replacements

/usr/bin/time -p target/release/ufofmt /Users/chris/Desktop/ufofmt-testing/*.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-Bold.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-Condensed.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-CondensedBold.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-CondensedLight.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-CondensedSemiBold.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-Light.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-Regular.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-SemiBold.ufo
real         0.31
user         0.30
sys          0.54

with single quote replacements

/usr/bin/time -p target/release/ufofmt --singlequotes /Users/chris/Desktop/ufofmt-testing/*.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-Bold.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-Condensed.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-CondensedBold.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-CondensedLight.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-CondensedSemiBold.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-Light.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-Regular.ufo
[OK] /Users/chris/Desktop/ufofmt-testing/NotoSansThai-SemiBold.ufo
real         0.44
user         0.32
sys          0.84

Not terrible... :)

Not sure where to get started on the norad side with something like you mentioned in https://github.com/source-foundry/ufofmt/issues/19#issuecomment-890050322. Any suggestions?

chrissimpkins commented 3 years ago

Not sure where to get started on the norad side with something like you mentioned in #19 (comment). Any suggestions?

For plist files: It looks like maybe we could switch the plist::to_file_xml (e.g. https://github.com/linebender/norad/blob/3a8b895a2995a012178b0c5f9f7fa15d3ecceb1a/src/font.rs#L274) calls to plist::to_writer_xml and do this with a writer at that stage?

chrissimpkins commented 3 years ago

For glif files, it looks like the target may be https://github.com/linebender/norad/blob/3a8b895a2995a012178b0c5f9f7fa15d3ecceb1a/src/glyph/serialize.rs#L30. This uses the quick-xml crate.

It is not possible to toggle the single/double quote type at the quick-xml level. From docs:

Does not escape any of its inputs. Always uses double quotes to wrap the attribute values.

chrissimpkins commented 3 years ago

Added in #25