Odd behaviour with heredoc last character

KyleKotowick commented 2 years ago

Terraform Version

Terraform v1.1.7
on linux_amd64

Terraform Configuration Files

Note: Although Terraform is running on Linux, this config file was created on Windows, so each new line in the file is \r\n.

locals {
  test_heredoc = <<EOT
content
EOT

  // Get the last character of the heredoc string
  last_heredoc_character = substr(local.test_heredoc, -1, -1)

  // We use these for comparison
  newline         = "\n"
  carriage_return = "\r"
}

// This gives the expected output
output "carriage_return_json" {
  value = jsonencode("\r")
}
// This gives the expected output
output "carriage_return_length" {
  value = length(local.carriage_return)
}
// This gives the expected output
output "newline_json" {
  value = jsonencode("\n")
}
// This gives the expected output
output "newline_length" {
  value = length(local.newline)
}
// This makes no sense at all
output "last_heredoc_character" {
  value = jsonencode(local.last_heredoc_character)
}
// This gives the expected output, but it doesn't make sense when we look at the `last_heredoc_character` output
output "last_heredoc_character_length" {
  value = length(local.last_heredoc_character)
}

Debug Output

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Outputs:

carriage_return_json = "\"\\r\""
carriage_return_length = 1
newline_json = "\"\\n\""
newline_length = 1
last_heredoc_character = "\"\\r\\n\""
last_heredoc_character_length = 1

Expected Behavior

The last_heredoc_character should be a single character

Actual Behavior

The last_heredoc_character has a length of 1, but is actually 2 characters (\r\n)?

Steps to Reproduce

Create the config file on a Windows computer, or use an editor that supports CRLF.
Copy the file to a Linux computer.
terraform init
terraform apply

Additional Context

I'm baffled by this. Somehow, selecting the last character of the heredoc returns a string with length of 1, but which jsonencodes into a string of two characters. And somehow, a newline (\n) has a length of 1, and a carriage return (\r) has a length of one, but the combination of both of them (\r\n) also has a length of 1?

Note: Although Terraform is running on Linux, the config file was created on Windows (which is why it contains \rs).

@apparentlymart This seems like the kind of bizarre behaviour that you enjoy hunting down.

KyleKotowick commented 2 years ago

Here's a simpler example that shows some oddness:

output "cr_length" {
  value = length("\r")
}
output "lf_length" {
  value = length("\n")
}
output "crlf_length" {
  value = length("\r\n")
}

cr_length = 1
lf_length = 1
crlf_length = 1

I believe this goes against Terraform's documented behaviour, which states that it uses UTF-8 encoding for everything, and UTF-8 encoding considers \r and \n to be two separate characters (U+000A and U+000D). The value of crlf_length should be 2 instead of 1 by this logic.

apparentlymart commented 2 years ago

Hi @KyleKotowick!

I think what you have here is an example of how when it comes to Unicode nothing is as simple as it first appears. The Terraform language's definition of strings is a sequence of what the Unicode Standard Annex #29 calls Grapheme Clusters, which they introduce in the specification as follows:

It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

I think what you've encountered here is rule GB3 from that specification: CR followed by LF is counted by Unicode as a single grapheme cluster. I would agree that this seems a little debatable, since I expect most users don't perceive a "line break" as a character at all, but I assume the rationale here is that those two code points together produce only a single newline and therefore if you do consider "start a new line" as being a character then CR+LF together represent that character.

Since we're software engineers rather than linguists we typically defer to Unicode for the finer details of how to define these concepts, and so the Terraform language uses exactly the grapheme cluster segmentation algorithm from that specification, including this interesting rule GB3. Therefore I think what you observed here is the Terraform language behaving as designed, but as usual with Unicode there's some considerable additional subtlety beyond what we might expect from a straightforward mental model of text encodings.

The documentation section you linked to is intended to describe how Terraform parses the source code, rather than how Terraform manipulates string values at runtime. I don't think the documentation gets into a lot of detail about the subtleties of UAX 29 because we've typically assumed that those details are not important in most situations. However, you have indeed found a situation here where these subtleties are important, and it looks like the length function is already documented for strings by reference to UAX 29. The substr function doesn't echo that reference in its own documentation, but it's designed to be consistent with all of the other string manipulations Terraform supports; they should all agree on the definition of "character", using the UAX 29 definition.

KyleKotowick commented 2 years ago

Thanks @apparentlymart, this clarifies the issue.

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / terraform