unified-font-object / ufoNormalizer

A tool that will normalize the XML and other data inside of a UFO.
Other
51 stars 19 forks source link

Trouble normalizing the indention of new lines in Glyph Notes #71

Closed colinmford closed 4 years ago

colinmford commented 4 years ago

Hi,

We've been noticing that if a glyph has line returns in a note, the Normalizer has some trouble formatting it.

For instance, here is how a note started out after it was added, and its first Normalization:

<note>
    Line1
    Line2
</note>

and with each round of normalization, all the lines after the first get indented an extra level

Once:

<note>
    Line1
            Line2
</note>

Twice:

<note>
    Line1
                    Line2
</note>

Thrice:

<note>
    Line1
                            Line2
</note>

etc.

colinmford commented 4 years ago

I think I tracked it down to XMLWriter.text() and XMLWriter.raw() https://github.com/unified-font-object/ufoNormalizer/blob/5b67c7aa760ef66748badc5f208a7d1aab9a0c69/src/ufonormalizer.py#L1180-L1199 https://github.com/unified-font-object/ufoNormalizer/blob/5b67c7aa760ef66748badc5f208a7d1aab9a0c69/src/ufonormalizer.py#L1170-L1174

if the text starts out as something like this

<note>
    Line1
    Line2
    Line3
</note>

it enters the XMLWriter.text() like this

"\n\tLine1\n\tLine2\n\tLine3\n"

it's immediately stripped on line 1181, making it

"Line1\n\tLine2\n\tLine3"

then, it's split into lines on line 1184, making the lines

("Line1", "\tLine2", "\tLine3")

At the end of the function, on line 1199, each line goes to XMLWriter.raw() and gets re-indented on line 1172. Making the self._lines list this:

["\tLine1", "\t\tLine2", "\t\tLine3"]

So, in #72 I use re.split() to split the text on this regex: r"[\n|\r|\r\n]\t*" instead of using .splitlines() on line 1184. It results in a nice clean list:

("Line1", "Line2", "Line3")

If there's a better way I would be open to trying again, but I think we can agree that the current behavior is not correct.

alerque commented 4 years ago

Why is the case of one tab being treated specially here? Why is whitespace internal to a free-form content string being munged at all? If this is to normalize indentation shouldn't all strings be parsed to find the shortest whitespace prefix first and then replace all occurrences of that whitespace string at the start of each line with the desired normalized indent?

colinmford commented 4 years ago

@alerque This test appears to making sure that white space isn't removed here, and I didn't want to contradict it.

If the decision is made that this is no longer expected behavior, then I would be totally ok with removing all whitespace prefixes. As a first-time contributor I didn't want to make that assumption though.

https://github.com/unified-font-object/ufoNormalizer/blob/5b67c7aa760ef66748badc5f208a7d1aab9a0c69/tests/test_ufonormalizer.py#L831-L837

colinmford commented 4 years ago

@alerque Ah, ok I think I take your meaning.

textwrap.dedent() does what you're talking about, and fixes my issue, but it seems to interfere with the test I posted above... (there's 3 spaces* at the start of Line1, and 4 at the start of Line3, so it leaves Line3 with an indent of 1 space, instead of all 4 in the test):

textwrap.dedent("   Line1  \t\n\n    Line3\t  ")
>>> 'Line1  \t\n\n Line3\t  '

Maybe that test needs to be reconsidered? Or we might need a custom function based on dedent.

colinmford commented 4 years ago

Since @moyogo wrote that test (according to Blame), maybe he can weigh in on what that test is specifically testing for?

moyogo commented 4 years ago

This was checking the behaviour of the ufoNormalizer at that time. If the behaviour changes, so should the test.

colinmford commented 4 years ago

@moyogo I don't really want to change other Normalizer behavior so I would like to keep all tests as-is as much as possible, but if you think that test is outdated we should discuss how it should change... but to rephrase my question:

  1. Is it possibly a mistake that in https://github.com/unified-font-object/ufoNormalizer/blob/5b67c7aa760ef66748badc5f208a7d1aab9a0c69/tests/test_ufonormalizer.py#L832 the first line has 3 spaces and the other has 4? i.e. were you intending to test for a 4-space indention vs tabs, and one of the spaces accidentally was removed?
  2. Is it the intention of this test that all whitespace prefixing all lines beyond the first line be kept intact?

I would really like to come up with an answer to this issue that works for everyone!

moyogo commented 4 years ago

@colinmford The behaviour of the normalizer needs to change since it changes notes that have already been normalized. I think the difference in the number of space was supposed to replicate the case where a user adds 3 spaces at the beginning of their note.

benkiel commented 4 years ago

My opinion is that it should keep those three spaces, as they may have semantic meaning to the designer, so I would do the least invasive thing —that you did first— or managing the extra tabs, I think that's the right solution to this, as it's a narrow, specific issue. Does that make sense @moyogo?

benkiel commented 4 years ago

@colinmford I would just test for the case of a line with a user added tab, so something like:

<note>
        Line1
            Line2
        Line3
</note>

The normalizer should keep the tab of Line2

colinmford commented 4 years ago

Hi @benkiel @moyogo

I updated the method and now all tests are passing (including the additional test case you mentioned @benkiel, and a couple other related ones).

More info in #72.

Let me know if you think it's suitable!

colinmford commented 4 years ago

Hi @benkiel @moyogo please don't forget to review #72, this is a really annoying issue I would like to fix (or is there someone else that should review it?)