Dual string interpolator not processing strings with escaped characters

rminehan commented 8 months ago

Reproduction steps

Scala version: 2.13 onwards (including scala 3) for Java 8.

scala> val s"Hello\t$name" = "Hello\tJames"
scala.MatchError: Hello James (of class java.lang.String)
  ... 32 elided

scala> val s"Hello\n$name" = "Hello\nJames"
scala.MatchError: Hello
James (of class java.lang.String)
  ... 32 elided

Problem

In both cases, the expected behavior is that name would be bound to "James", but instead a MatchError is thrown.

See also original SO post where this was raised. This happens when other escaped characters are involved like "\", "\"", "\r".

Background

This dual interpolation feature was introduced into scala 2.13 by PR 7387.

It added an s.unapply method for destructuring strings based on interpolation patterns. The underlying implementation for that is the glob method which I analyze below.

object s {
  ...
  def unapplySeq(s: String) = StringContext.glob(parts, s)
}

object StringContext {
  ...
  def glob(patternChunks: Seq[String], input: String): Option[Seq[String]] = ...
}

Investigation

I have done some preliminary investigation and I believe the root cause is that the parts produced in the examples above are representing an escaped character as two separate characters, e.g. a tab character becomes a backslash and a "t". But the glob function introduced in the PR

For example code like this:

val s"$a\t$b" = "0\t1"

leads to a StringContext being constructed with parts:

Seq(
  "", // empty string for the text prior to "$a"
  "\\t", // string of length 2, a backslash, then a 't'
  "", // empty strong for the text after "$b"
)

The glob method is effectively called with this data:

glob(Seq("", "\\t", ""), "0\t1")
// returns None

Because it returns None, the s.unapplySeq returns None and the pattern match fails leading to a MatchError.

I verified this using a debugger and setting a breakpoint inside the glob method.

When glob is passed the expected data, a Some is returned as expected:

glob(Seq("", "\t", ""), "0\t1")
// Option[Seq[String]] = Some(value = ArraySeq("0", "1"))

So I think the "bug" is either:

StringContext parts aren't being produced correctly with regards to escaped characters
StringContext parts are being produced correctly with regards to escaped characters, but glob incorrectly assumes they'll be a single character

Other notes

The behavior appears to be consistent for any scala version from 2.13+ and the java version isn't relevant.

On my SO post, I asked if this feature has an official name. That would be helpful going forward for documenting this and doing searches. For now I'm calling it the "dual string interpolator".

som-snytt commented 8 months ago

Another view from REPL

scala> val s"Hello\t$name" = "Hello\tJames" // print

val name: String = ("Hello\tJames": @scala.unchecked) match {
  case scala.StringContext.apply("Hello\\t", "").s.unapplySeq((name @ _)) => name
}

That much is per spec.

The expansion of parts is shown there in triple-quotes.

The extractor calls itself "the simple string matcher". Maybe that is better than "dual". Or perhaps it is a "duel" to see which is simpler.

Perhaps it was an inadvertent omission.

I neglected to mention for the public record that it is clearly the "standard" interpolator.

The dual is the standard extirpolator.

Edit: the jokey "extirpolator" sounded funnier at midnight. Perhaps we are stuck with "extrapolator" as a canonical label.

SethTisue commented 8 months ago

@rminehan thanks for the thorough report. re: naming, one approach is to just call it "thes interpolator" 🤷 At some level it's just a method and not a "feature"; string interpolation is the feature, individual interpolators are methods.

som-snytt commented 8 months ago

The obvious follow-up will be the raw"" extrapolator, when library evolution permits.

The less obvious follow-up would be the f"" extrapolator for typesafe scanf.

rminehan commented 8 months ago

Hi @som-snytt and @SethTisue , thanks for the replies.

@som-snytt

That much is per spec.

The expansion of parts is shown there in triple-quotes.

I want to clarify I have understood you correctly. You're saying that the formal spec dictates that a code snippet:

id"text0{ pat1 }text1 … { patn }textn"

is transformed to:

StringContext("""text0""", …, """textn""").id(pat1, …, patn)

hence in your example:

 val s"Hello\t$name" = "Hello\tJames"

the s"Hello\t$name" leads to a context:

StringContext.apply("""Hello\t""", """""")
// equivalent to
StringContext.apply("Hello\\t", "")

So the tab character is effectively split into two runtime characters because it is represented as two separate characters in the original source code.

So these are roughly equivalent programs:

// Original code
val s"Hello\t$name" = "Hello\tJames"

// Equivalent expanded program
val interpolator = scala.StringContext("""Hello\t""", """""").s

val name = "Hello\tJames" match {
  case interpolator(x) => x
}

// scala.MatchError: Hello James (of class java.lang.String)

If my understanding is correct, then that would mean that the "bug" is in the implementation of the glob method. It should be handling cases where there's backslashes in strings from patternChunks.

But do you think that when the original spec was written, it had taken into consideration escaped characters? I worry that it was not considered.

It feels more natural to me that the handling of backslash characters would be done when the string context is created, such that the parts in the context capture escaped characters as a single runtime character. If extractors are made for other interpolators (e.g. f""), then that implementation will have to handle the same problem.

But there might be other cases I'm not aware of, e.g. impacts on raw"".

If you guys think that the root cause of the issue is in the glob method, then I can look into fixing it. That's a pretty self-contained task and it's easy to test and verify.

som-snytt commented 8 months ago

Thanks, I already submitted a PR, see "May be fixed by", above. I did not intend to steal work from you. You can contribute further by reviewing whether the fix makes sense to you.

I added escape processing before the call to glob, which may be used by clients which don't want escapes processed.

How to handle escapes is up to the interpolator, which can have arbitrary behavior.

rminehan commented 8 months ago

Hi @som-snytt , sorry I didn't see your message earlier, thanks for much for fixing it!

scala / bug