Closed KyleKotowick closed 2 years ago
Here's a simpler example that shows some oddness:
output "cr_length" {
value = length("\r")
}
output "lf_length" {
value = length("\n")
}
output "crlf_length" {
value = length("\r\n")
}
cr_length = 1
lf_length = 1
crlf_length = 1
I believe this goes against Terraform's documented behaviour, which states that it uses UTF-8 encoding for everything, and UTF-8 encoding considers \r
and \n
to be two separate characters (U+000A
and U+000D
). The value of crlf_length
should be 2
instead of 1
by this logic.
Hi @KyleKotowick!
I think what you have here is an example of how when it comes to Unicode nothing is as simple as it first appears. The Terraform language's definition of strings is a sequence of what the Unicode Standard Annex #29 calls Grapheme Clusters, which they introduce in the specification as follows:
It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.
I think what you've encountered here is rule GB3 from that specification: CR
followed by LF
is counted by Unicode as a single grapheme cluster. I would agree that this seems a little debatable, since I expect most users don't perceive a "line break" as a character at all, but I assume the rationale here is that those two code points together produce only a single newline and therefore if you do consider "start a new line" as being a character then CR+LF together represent that character.
Since we're software engineers rather than linguists we typically defer to Unicode for the finer details of how to define these concepts, and so the Terraform language uses exactly the grapheme cluster segmentation algorithm from that specification, including this interesting rule GB3. Therefore I think what you observed here is the Terraform language behaving as designed, but as usual with Unicode there's some considerable additional subtlety beyond what we might expect from a straightforward mental model of text encodings.
The documentation section you linked to is intended to describe how Terraform parses the source code, rather than how Terraform manipulates string values at runtime. I don't think the documentation gets into a lot of detail about the subtleties of UAX 29 because we've typically assumed that those details are not important in most situations. However, you have indeed found a situation here where these subtleties are important, and it looks like the length
function is already documented for strings by reference to UAX 29. The substr
function doesn't echo that reference in its own documentation, but it's designed to be consistent with all of the other string manipulations Terraform supports; they should all agree on the definition of "character", using the UAX 29 definition.
Thanks @apparentlymart, this clarifies the issue.
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Terraform Version
Terraform Configuration Files
Note: Although Terraform is running on Linux, this config file was created on Windows, so each new line in the file is
\r\n
.Debug Output
Expected Behavior
last_heredoc_character
should be a single characterActual Behavior
last_heredoc_character
has a length of 1, but is actually 2 characters (\r\n
)?Steps to Reproduce
CRLF
.terraform init
terraform apply
Additional Context
I'm baffled by this. Somehow, selecting the last character of the heredoc returns a string with length of 1, but which
jsonencodes
into a string of two characters. And somehow, a newline (\n
) has a length of 1, and a carriage return (\r
) has a length of one, but the combination of both of them (\r\n
) also has a length of 1?Note: Although Terraform is running on Linux, the config file was created on Windows (which is why it contains
\r
s).@apparentlymart This seems like the kind of bizarre behaviour that you enjoy hunting down.