scalameta / scalafmt

Code formatter for Scala
http://scalameta.org/scalafmt
Apache License 2.0
1.44k stars 276 forks source link

Scalafmt accepts only space as whitespace in comments, trailing, and pipe #1418

Closed LeeTibbert closed 4 years ago

LeeTibbert commented 5 years ago

From examining the regex patterns used by scalafmt, approx May 1, 2019, it appears that scalafmt accepts only spaces/blanks as whitespace in three places. The first is comments before the asterisk, the second is in whitespace before pipe characters in stripMargin blocks and the third is in trailing whitespace.

I believe that accepting all horizontal whitespace is a more current behavior. That would mean that tabs, Unicode whitespace, and a couple of other whitespace characters would be accepted. I have supplied a regex containing what I consider a good candidate list privately.

After this fix trailing tabs and leading Ogham space marks will all be stripped/replaced.

Please contact me if you need more details.

tanishiking commented 5 years ago

@LeeTibbert Thank you for reporting! To make the problem clearer, could you follow the ISSUE_TEMPLATE please :) ?

ekrich commented 4 years ago

@LeeTibbert Did you want to follow up on this issue as requested above?

kitbellew commented 4 years ago

@LeeTibbert you mentioned that you supplied a regex but not here. does yours match the change in #1911?

LeeTibbert commented 4 years ago

@kitbellew Thank you for asking my thoughts. I have been taking a well deserved overnight of rest from having my brain scrambled by other code and just come to this.

I do not know if the issue is moot or not, seeings as the underlying PR has been accepted. That PR has the virtue of being done and not waiting for me to get even a few centimeters down my urgent pile. I also do not know if there is a better place to discuss my concerns. Please advise.

Since the issue might be moot, I will limit my reply here to the top two issues. Commenting on regex expressions is like commenting on the beauty or lack thereof of someone else's dog: Best not done if you want to remain invited to polite society.

1) I found the top of the parabola of my previous work. Each of the three Patterns has similar issues, which I can discuss if invited. This is the form I recommended for one:

val leadingAsteriskSpace = Pattern.compile("^\\s*\\*", Pattern.MULTILINE)

At the time, I did not add UNICODE because I was trying to be conservative & minimal change in a Proof of Concept environment. The regex here is way simpler than the regex in the PR. It is easier to comprehend and probably also faster at runtime. I have not tried with the UNICODE flag on, but would expect it to work and have only the concern of no experience. I found some of the Unicode white space characters tricky, "\s" (or, depending on context, its double backslash variant) seemed to work best for me.

kitbellew commented 4 years ago

@LeeTibbert consider yourself invited.

The reason the other regex is more complex is that for MULTILINE, \s matches newline which is undesirable in this case.

LeeTibbert commented 4 years ago

I do not know if support for Scala Native is still a goal for Scalafmt. I know, but have not tried, that recent version(s) of Scalafmt support Graal. (BTW, I love the scalafmt plugin, saves me bunches of frustration and eases my work flow).

If eventual SN support is a goal, the positive lookahead operator "(?=" is not supported in Scala Native and will never be, for most values of never. Positive lookahead is also expensive at runtime when a simple match against a known character is possible.

I think SN support is "shovel ready", probably needs build support. It would be a pity to introduce known breakage.

LeeTibbert commented 4 years ago

OK, I will dive in and spend some time. I suggest that we focus on one regex at a time, even though they are related. re:

The reason the other regex is more complex is that for MULTILINE, \s matches 
newline which is undesirable in this case.

I'll study my test cases and the test cases from the underlying PR.

LeeTibbert commented 4 years ago

Java 8 & Java 14 define horizontal whitespace as

'[ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]' // note the blank. Not all regex engines have '\h' but the JVM does.

The Pattern.UNICODE_CHARACTER_CLASS pretty much hoses SN, so you could probably go with '\h'. At the very least, you could use the explicit value above. Given that it wired into Java for many versions, it is unlikely to change, even though the Unicode people like to drop new characters in. (There is a \p(Zmumble) which is safer for Unicode. I think but do not know, that Java supports it.)

kitbellew commented 4 years ago

Java 8 & Java 14 define horizontal whitespace as

'[ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]' // note the blank. Not all regex engines have '\h' but the JVM does.

@poslegm can we assume we are Java 8 or above? (\h is definitely cleaner than what we used.) In Java 7, \h is not supported: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#jcc

The Pattern.UNICODE_CHARACTER_CLASS pretty much hoses SN, so you could probably go with '\h'.

What's "SN"?

LeeTibbert commented 4 years ago

The horizontal whitespace idiom above should allow leadingAsteriskSpace to become (untested) s"^${horizontalWhitespaceClass}*\\*)

It should drop into trailingSpace and something similar should work for compileStripMarginPattern.

Unless there is some filtering upstream, I think compileStripMarginPattern has a bug where a user could specify a character which would break the regex.

LeeTibbert commented 4 years ago

SN is my being lazy and not typing out Scala Native (or scalanative, or other variants in wide use).

LeeTibbert commented 4 years ago

The reason that I suggest removing the positive lookahead "(?=)" and simplifying (making dead stupid) the three regexen is that some poor, probably overworked and in a rush, soul is going to have to read this on the fly in the future. With git blame around, the original devos may have to rack their brains to understand the regex. I can speak from sad experience it is humbling to realize on the spot that one is not as clever as one was a year or five ago because one can not suss one's own regex ;-)

ekrich commented 4 years ago

@kitbellew As a background, I ported lightbend/config to Scala to support scalafmt on encouragement from @olafurpg . It is now ekrich/sconfig. Then a ton of other work was done by @LeeTibbert , @olafurpg , @densh and myself. scalafmt was to be a SN killer app and stress test. This was largely before Graal became a significant thing.

LeeTibbert commented 4 years ago

I corrected a typo in the suggested regex I posted above. The typo made Markdown mangle it so that is was obviously incorrect and just plain stupid. I like to earn my stupidity for myself and not have Markdown hand it to me.

LeeTibbert commented 4 years ago

If there were any hope of Scalafmt supporting Scala Native, I could probably take a look at adding "\h" and its dual "\H" to ScalaNative. That support is in the priority queue, but far back, looking for an identified user (or, more likely & rightly, complainant).

kitbellew commented 4 years ago

The reason that I suggest removing the positive lookahead "(?=)" and simplifying (making dead stupid)

if we include the asterisk in the pattern, we'd have to put it back in the replacement string. positive lookahead is not very common, of course, but it's easy to look up in the java documentation.

i am not sure why the pattern doesn't want two consecutive asterisks, but that's what it was before.

as to \h: if SN supports [^\S\r\n] then we might as well keep it. the regex is not as short as \h but i would actually say that for \h i would have to look up documentation while [\S\r\n] are probably as close to common knowledge as regex can get.

i can't comment on scalafmt supporting SN (that, and alignment, are outside of my expertise).

ekrich commented 4 years ago

Talking to @olafurpg previously, he was waiting for a true 0.4.0 production release but it was still in the plans. It would be a really big win for Scala Native to support scalafmt as a native application. @LeeTibbert 's regex work was added for the 0.4.0 release and the performance was greatly increased.

poslegm commented 4 years ago

@poslegm can we assume we are Java 8 or above?

It's not documented but scalafmt hasn't CI checks for older version of Java. So, I can't even say that it works with them at all

ekrich commented 4 years ago

I would say Java 8 is the minimum since it is required for Scala 2.12 and above.

LeeTibbert commented 4 years ago

positive lookahead is, I believe, more than expensive at runtime. With regex it is exceedingly easy to burn CPU cycles. A simple match against an asterisk is fast. I would have to study the scalafmt code but I think it did put the asterisk (or pipe, in that section) back.

The "\h" is way simpler than a bunch of negations and double negations "^\S". Anybody trying to read this stuff is going to have a regex translator open in the browser anyway.

I am not trying to be a pain: Your house, your choice. I mean that in a supportive way.

LeeTibbert commented 4 years ago

The Pattern.UNICODE_CHARACTER_CLASS makes ScalaNative support difficult/hard.
For now, I suggest letting the base PR rest. There is a certain virtue to done.

When support for Scala Native comes up for discussion (It would be great to push it over the goal line.) I do some private design studies. Given the known set of characters, I think the Pattern.UNICODE_CHARACTER_CLASS can be dropped and the match done against the explicit set of characters. "\h" might be sensitive to the flag so it might not be a good candidate.

Let me know if I should spend the time. I have not built Scalafmt on Scala Native for almost a year, it might not still build.

kitbellew commented 4 years ago

@LeeTibbert frankly, i'd prefer to fix it now; if you believe unicode is not a good idea, let's remove it now, before anybody starts using it.

perhaps, i'd even ask you to submit a pr yourself, so that whatever you describe (including the lookahead optimization) can be expressed as code directly.

LeeTibbert commented 4 years ago

@kitbellew You are right, code on the page will make for easier discussion & review. Thank you for paying attention to and responding kindly to my concerns.

LeeTibbert commented 4 years ago

For future reference, I found the repository that had the regexen I had recommended at the time. I had sent that URL to a scalafmt devo as part of a long series of correspondence. My use of "I supplied" in the base topic was misleading English. The passage of time has shown that I could have been clearer to devos picking up the issue. I plead only the End of Sprint fatigue of the original work.

Supplying the entire repository gave more context than could be given in an issue: real working shovel ready code.