unified-font-object / ufoLib

A low-level UFO reader and writer.
Other
37 stars 19 forks source link

Validation optional #153

Closed benkiel closed 6 years ago

benkiel commented 6 years ago

Add the option to turn on/off validation. Switch to lxml.

anthrotype commented 6 years ago

wow that was fast! ๐Ÿ˜ƒ sorry guys, I just came back from holidays and am still going through the backlog of emails so didn't get the change to review this.

So basically now we are happy to have a required dependency on lxml (a native extension module)? The times are a-changin' ๐ŸŽ‰

benkiel commented 6 years ago

@anthrotype Yes, we're fine with the dependency โ€” did you want to contribute the lxml plist bit to this?

anthrotype commented 6 years ago

I don't see lxml specified in the setup.py "install_requires" though... you only added it to the requirements.txt file, which is only used internally here for creating the test environment. If one installs ufoLib from pypi in a fresh environement where lxml is not already installed, this will now raise ImportError, if it truly is required. We need to fix this ASAP. I can do that.

benkiel commented 6 years ago

@anthrotype ah, crud, sorry โ€” that's my bad for missing it. Yes, if you can fix, that would be great, much appreciated

anthrotype commented 6 years ago

can you remind me again why in the end we decided to have lxml as hard dependency as opposed to use the ElementTree API (supported by both lxml and built-in ElementTree library) and try to use lxml when it's available else fall back to the built-in ElementTree library if not?

I have mixed feelings because, on the one had, I'm happy when we add good third-party dependencies (usually better than our artisanal "reinvented" wheel); on the other hand, this shouldn't be done too lightly, especially when this is compiled extension, and one which, unlike the others (compreffor, pyclipper, etc.) we don't control directly.

anthrotype commented 6 years ago

https://travis-ci.org/robofab-developers/fontParts/jobs/399648345#L492

benkiel commented 6 years ago

@anthrotype We could do that, not opposed. We wanted to make lxml the thing for speed, but understand wanting a graceful fallback.

anthrotype commented 6 years ago

but maybe you or @typesupply decided to require lxml because you're depending on some lxml-specific features that are not available with the built-in elementtree library? we need to test this it not the case

anthrotype commented 6 years ago

or if it is, then fine, we can add lxml as required dependency.

the problem is, we need to decide now, otherwise everybody who does pip install fontmake or fontParts or anything will get ImportError

typesupply commented 6 years ago

It's been a really long time since I wrote this... but, I didn't think I was relying on any LXML specific features. Falling back to ET when LXML isn't available is fine with me.

anthrotype commented 6 years ago

we are using pretty_print=True option in etree.to_string function, which only works for lxml, not for elementtree library, whose writer can't automatically indent.. I guess we need to use something like this as fallback: https://github.com/fonttools/fonttools/blob/8d7774a3e81df1afc6dedb1272e43c7c3aea100e/Lib/fontTools/designspaceLib/__init__.py#L91-L105

typesupply commented 6 years ago

Ah, yeah. I forgot about that. It's so weird that ET doesn't have this.

anthrotype commented 6 years ago

and also, the encoding="unicode" argument to etree.tostring is a lxml thing. Actually it's a python3 thing https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.tostring

anthrotype commented 6 years ago

i'm going to encode as a UTF-8 bytes string and decode it as Unicode string before returning it from writeGlyphToString

anthrotype commented 6 years ago

two more problems

1) ElementTree's tostring always seems to sort attributes alphabetically, even if we use OrderedDict; lxml uses the order we give it. (in the diff below, the minus lines are lxml's, plus are ElementTree's)

 - <glyph name="a" format="2">
 + <glyph format="2" name="a">

2) When it's a simple element (without children), ElementTree always adds an extra space before the ending />, lxml doesn't:

 - <unicode hex="0062"/>
 + <unicode hex="0062" />
anthrotype commented 6 years ago

yeah, ElementTree's _serialize_xml functoin calls sorted on the attributes items: https://github.com/python/cpython/blob/831c29721dcb1b768c6315a4b8a4059c4c97ee8b/Lib/xml/etree/ElementTree.py#L928

:cry:

benkiel commented 6 years ago

Perhaps that's enough to say "Alright, sorry, lxml is a requirement"?

anthrotype commented 6 years ago

it was worth a try. I really wouldn't like to implement our own elementtree xml serializer.. Let's require lxml then.

benkiel commented 6 years ago

@anthrotype Agree. If there is pushback on the lxml requirement, we can revisit. Thanks for all the work, really appreciated!

anthrotype commented 6 years ago

just pushed v2.2.1.

Oh, when tagging a new release, I like to "edit" the github release notes so that they will be displayed as formatted markdown, instead of as a monospaced code block that is used by default when pushing a git tag. I basically copy and paste the same text of the git tag. Probably there are tools to automate this, but I'm lazy.

benkiel commented 6 years ago

@anthrotype got it. and, thanks again; know we just did that one quick โ€” it had been sitting for a while with no comments on the other thread so before the code got too stale it was merged. Thanks again for your running down ETree and everything โ€” it is really appreciated!

chrissimpkins commented 6 years ago

Thoughts about a major version increment for these changes? It sounds like the mandatory use of lxml writes may have the potential for large source text diffs c/w the last release for some users (if I understood Cosimo's comment above correctly) and the elimination of default source validation on UFO reads through the main i/o reading class is a significant change in expected behavior. This has the potential to lead to breakage in projects that depend on the previous default behaviors.

benkiel commented 6 years ago

@chrissimpkins Fair point. There shouldn't (fingers crossed) be too many diff changes, we didn't change the test cases and made lxml conform to the tests, though I suppose there will still be some. Also open to changing read to validate by default if that seems better.

anthrotype commented 6 years ago

I also think it would be better to bump major version for this. I'll add the plistlib module from ufoLib2 and then go to 3.0, if you guys are also OK with it.

benkiel commented 6 years ago

@anthrotype Agree. What do you think about the default being to validate on read?

anthrotype commented 6 years ago

maybe that could be safer, as long as it's easy to opt-out. @typemytype seems to prefer validate=False here https://github.com/typesupply/defcon/pull/191/files

anthrotype commented 6 years ago

in the defcon PR I linked, the defaults are rather the opposite of what @benkiel was suggesting:

https://github.com/typesupply/defcon/pull/191/files#diff-c6a63107dc3c355ab30047131285dc06R452

read is False, write is True.. Not sure what's best.

typesupply commented 6 years ago

My 2ยข is that validation should be on by default in all cases. That preserves backwards compatibility, for what that is worth. I also didn't engineer defcon against malicious input so I prefer that turning validation off be something that one has to knowingly do.

anthrotype commented 6 years ago

makes sense. Ok, let's flip the default to be validate=True in ufoLib, and probably defcon too. @benkiel would you work on that? I'll work on integrating the new plist module.

typemytype commented 6 years ago

I was just following ufoLib in defcon, will change it to read and write to be validated by default in defcon.

benkiel commented 6 years ago

@anthrotype done in #155. This now means that validation is on by default for both read and write (it was on for write before)