Transparently transforming encoding of lf inside strings can lead to subtle bugs

request-info[bot] commented 4 years ago

This issue has either a default title or empty body. We would appreciate it if you could provide more information. Note: I am not a very intelligent bot, I can only react to new comments. Please add a comment for me if you update the body or title.

Ducasse commented 4 years ago

running this test on mac, linux on travis is passing

testCharLf

self assert: Character lf asString equals: (String streamContents: [:s | s lf ])

running this test on mac, linux on travis is failing

testCharLf2 "this test is special because it embedds a string that contains a lf"

self assert: Character lf asString equals: "(String streamContents: [:s | s lf ])" '

'

S

PharoProject commented 4 years ago

No, new lines in source code are CRs, not LFs, this has always been the case.

' ' first = Character cr

=> true

On 8 Dec 2019, at 10:30, StéphaneDucasse notifications@github.com wrote:

running this test on mac, linux on travis is passing

testCharLf

self assert: Character lf asString equals: (String streamContents: [:s | s lf ])

running this test on mac, linux on travis is failing

testCharLf2 "this test is special because it embedds a string that contains a lf"

self assert: Character lf asString equals: "(String streamContents: [:s | s lf ])" '

'

S

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Ducasse commented 4 years ago

Please read what I wrote. Of course the editor/compiler encodes Enter as Character cr but this is clearly not what I meant.

Ducasse commented 4 years ago

as the second test is showing it if we programmatically have a lf in a string because for example you want to test that you generated correctly lf in the resulting string then only on travis (and not locally on my machine) this test is failing.

Ducasse commented 4 years ago

The second test is not produced with an enter key in the text editor but with (String streamContents: [:s | s lf ]) printString.

Ducasse commented 4 years ago

So do you see now the difference between the two tests? a Character lf produced but not stored in a string is handled well - first test. and if we convert by printing the string then it is not working. When you inspect that string it contains lf!

Ducasse commented 4 years ago

testCharLf2 "this test is special because it embedds a string that contains a lf"

| buggyButcontainingLF | 
"(String streamContents: [:s | s lf ]) to produce the string buggyButcontainingLF below"
buggyButcontainingLF := '

'. self assert: (buggyButcontainingLF at: 1) equals: Character lf. self assert: Character lf asString equals: ' '

https://travis-ci.com/Ducasse/Descent/jobs/264819527

Travis CI - Test and Deploy with Confidence
Travis CI enables your team to test and ship your apps with confidence. Easily sync your projects with Travis CI and you'll be testing your code in minutes.

Ducasse commented 4 years ago

On my machine this test passes perfectly but it fails on travis. May be this is an unzip problem but this is scary and strange.

svenvc commented 4 years ago

it is still not 100% clear to me how you produce the problematic string.

you say 'do x to produce the string', but where ? in a browser on your machine ? how does is end up in travis CI ? via committed code ? Tonel code ?

If you enter an LF in source code, you get a 'method source contains linefeed' code critic, because this is a bad idea.

I assume that Tonel turns all line endings into CR when exporting.

See TonelWriter>>#splitMethodSource:into:

at the end it says

source upToEnd withLineEndings: self newLine

which basically does a conversion (but to a platform dependent one, CR on mac).

I can't immediately find what Tonel does when reading, but by then you already lost your LF

Now, the critic rule is there for a reason, it is a bad idea to depend on specific line endings in literal strings.

Ducasse commented 4 years ago

When I edited the code while writing my test I just printed the expression and use the result. Explain me then how I can test a pretty printer that produces lf and not cr. Yes I can use Character lf asString and concatenate little strings. I can also recode the pretty printer manually. But to me this is bug! the end of line inside strings should not changed magically. I should be able to code towards a specific platform. Then there is still a problem because my "bad" code is working on my machine when I load my code on Mac OS X but it is not working on Mac OS X on travis.

Ducasse commented 4 years ago

And here I'm not talking about code of the method but encoding inside strings inside a method. So to me encoding inside strings should not be modified.

Ducasse commented 4 years ago

Guille pablo I would like to discuss with you next time we see each others.

Ducasse commented 4 years ago

Normally I loaded these tests (because I'm working on another project) so I do not see the difference but I'm trying again with a fresjh 8

Ducasse commented 4 years ago

But I was wrong. Apparently testLf2 is wrong on my machine and I prefer this. So indeed Tonel is probably changing the code and to me this is not good.

svenvc commented 4 years ago

On 8 Dec 2019, at 12:41, StéphaneDucasse notifications@github.com wrote:

Explain me then how I can test a pretty printer that produces lf and not cr.

You failed to mention this.

Have a look at STONWriter>>#newLine: and its senders in STON tests.

It is perfectly possible to generate textual output that uses different line end conventions and still write tests for them. I do this for NeoCSV, NeoJSON and others.

I don't see the issue: you should not depend on newlines being preserved inside strings in source code - formatters can/will change them, Tonel does, maybe it happens in other places. And one day, we will probably have to replace CR in the whole image anyway (because it is no longer the default, even on macOS where it came from).

So basically, from everything you wrote, I see no issue. Though I admit is might be surprising, but there is the code critic rule to help you.

svenvc commented 4 years ago

BTW, here is another point:

You can use #lines to parse any string into lines, regardless of the line end convention used.

So basically you could do something like

String lf join: (ZnClient>>#execute) sourceCode lines.

this gives you LF source code instead of CR, regardless of what was used originally.

HTH

Ducasse commented 4 years ago

I understand your point. In general this is better to rely on explicit behavior versus implicit encoding inside a string. I changed my tests already to handle this. What was super strange is that I loaded my code and my tests where green, then I modified them to get explicit and to build hypotheses on why it was breaking on travis. Now I do not get why my tests did not break the first time. I will reload an old version to check. Because loading the specific tests I wrote to identify the problem break (hopefully) on my machine.

Ducasse commented 4 years ago

I checked and loading an old version is breaking too so this is good. Now while I understand that explicit is better. I would prefer that the code encodings and in particular the string encodings is not changed during fileout. I prefer garbage in, garbage out. But may be I wrong. For pillar we were discussing with damien that you may want to validate the output of windows encodings but running on a unix system.

Ducasse commented 4 years ago

I read

testConvertingNewLines | input result output | input := '''line ending with CR', String cr, 'line ending with LF', String lf, 'line ending with CRLF', String crlf, ''''. output := 'line ending with CR', String crlf, 'line ending with LF', String crlf, 'line ending with CRLF', String crlf. result := (STON reader on: input readStream) newLine: String crlf; convertNewLines: true; next. self assert: result equals: output. output := 'line ending with CR', String cr, 'line ending with LF', String cr, 'line ending with CRLF', String cr. result := (STON reader on: input readStream) newLine: String cr; convertNewLines: true; next. self assert: result equals: output

and yes I was doing the same once I understood the problem.

Ducasse commented 4 years ago

String lf join: (ZnClient>>#execute) sourceCode lines. is nice now since I'm explicitly using lf in my pretty printer, I'm just emitting lf

Ducasse commented 4 years ago

splitMethodSource: aMethodDefinition into: aBlock | keywords source declaration |

keywords := aMethodDefinition selector keywords.
source := aMethodDefinition source readStream.
"Skip spaces"
(source peek isSeparator) ifTrue: [ self skipSeparators: source ].
"Skip comments"
(source peek = $") ifTrue: [ self skipComment: source ]. 
"Parse declaration"
declaration := String new writeStream.
[ (self selectorIsComplete: keywords in: declaration originalContents) not 
    or: [ ':+-/\*~<>=@,%|&?!' includes: declaration contents trimRight last ] ]
whileTrue: [ 
    "stop infinite loop if no match was found"
    source atEnd ifTrue: [ TonelWriteError signal: 'Cannot find selector in source for ', aMethodDefinition asString ].
    "get separators"
    [ source atEnd not and: [ source peek isSeparator ] ]
        whileTrue: [ declaration nextPut: source next ].
    "take next word until we find a separator or a dot"
    [ source atEnd not and: [ source peek ~= $. and: [ source peek isSeparator not ] ] ]
        whileTrue: [ declaration nextPut: source next ] ].
aBlock 
    value: (declaration contents trimLeft withLineEndings: self newLine)
    value: (source upToEnd withLineEndings: self newLine)

I do not see where strings are converted too. But now I understand that this is an corner case and I prefer that. I should have reloaded my code first.

Sometimes I would really like to have a mode where we can see invisible character in strings and in the code. It would make things simpler to understand.

Ducasse commented 4 years ago

Anyway thanks for your time.

kasperosterbye commented 4 years ago

There is even a rule in the code smell detector which will warn against line feed inside the code (RBMethodSourceContainsLinefeedsRule). From the above discussion it would seem to be obsolete.

Ducasse commented 4 years ago

I do not think. This is not because there is a rule that we should not think that it is strange to change string (and not source code) encodings.

pharo-project / pharo

Transparently transforming encoding of lf inside strings can lead to subtle bugs #5334