Add support for "raw" strings, which do not escape their contents.

jakebiesinger-onduo commented 4 years ago

Current Terraform Version

Terraform v0.12.24

Use-cases

Some resources tend to be heavy on regular expressions. For example, GCP's Stackdriver Monitoring metrics includes the ability to pull out labels based on regex matches. Since HCL doesn't support any kind of "raw" string, these patterns can get pretty unwieldy.

For example, to match the string finished with status code: 200, we need the pattern finished with status code:\s(\d{3})). Expressing \d in HCL requires us to escape the characters for HCL, and then escape the characters again for the resource:

resource "google_logging_metric" "batch_finished_status" {
  project = local.project
  name = "process-batch-events-finish"
  filter = <<-EOT
    resource.type="cloud_function"
    resource.labels.function_name="batch-thing"
    textPayload:"finished with status code:"
  EOT
  metric_descriptor {
    metric_kind = "DELTA"
    value_type = "INT64"

    labels {
      description = "Status code"
      key = "status_code"
      value_type = "STRING"
    }
  }

  label_extractors = {
    status_code = "REGEXP_EXTRACT(textPayload, \"finished with status code:\\\\s(\\\\d{3})\")"
  }
}

More complicated patterns can be even worse. For example, to match the string [SUCCESS_RATIO] 32.5%, we also have to escape the [ characters, as you'd expect in a regex. But since this is HCL, we have to escape the escapes as well, yielding a nearly-illegible entry:

  label_extractors = {
    ratio = "REGEXP_EXTRACT(textPayload, \"\\\\[SUCCESS_RATIO\\\\]\\\\s(\\\\d+\\\\.\\\\d+)\")"
  }

Attempted Solutions

For monitoring, our current workaround is to build all metrics in the UI and then import them into terraform. Not all resources have that option, and we're still left with code that's hard to decipher.

Proposal

It would be nice to indicate to terraform that a string should be treated as "raw", as you can in python and in many other languages. For example, in python, you can prefix a string with the r character to turn off all escaping within the string, meaning r"hello\tworld" will not expand \t into a tab character.

In HCL, it would be nice to have similar support. Then, my strings could become the more reasonable

  label_extractors = {
    ratio = r"REGEXP_EXTRACT(textPayload, \"\\[SUCCESS_RATIO\\]\\s(\\d+\\.\\d+)\")"
  }

That internal " presents a problem, but the python folks allow raw strings to still escape quotes. From their docs:

Even in a raw literal, quotes can be escaped with a backslash, but the backslash remains in the result; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw literal cannot end in a single backslash (since the backslash would escape the following quote character).

References

https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

pkolyvas commented 4 years ago

@jakebiesinger-onduo This is a cool initial proposal. We've certainly seen many cases where escaping requirements catch engineers out; the examples you've provided are, well, harrowing.

While the Terraform Team isn't likely to work on this in the near term, we'd certainly welcome a technical proposal to discuss working towards an eventual community PR.

mnatan commented 4 years ago

That is how we make it more manageable: Screen Shot 2020-08-07 at 18 15 32 jsonencode function escapes strings for you.

mwarkentin commented 4 years ago

I'm struggling with this while trying to configure an AWS Glue Data Table for Athena queries against ALB access logs (docs).

For reference, the regex I'm trying to recreate is pretty crazy: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"

As far as I can tell, it needs to have the escaped chars (\" and \s). I've tried about 10 different combinations of inline strings, loading the regex string from a file, and using an <<EOT string without luck. I've gotten close with the \" characters working as expected but \s was becoming \\s.

My latest plan looked right:

~ "input.regex"          = <<~EOT
                      - ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) "([^ ]*) ([^ ]*) (- |[^ ]*)" "([^"]*)" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) "([^"]*)" "([^"]*)" "([^"]*)" ([-.0-9]*) ([^ ]*) "([^"]*)" "([^"]*)" "([^ ]*)" "([^\s]+?)" "([^\s]+)" "([^ ]*)" "([^ ]*)"
                      + ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"

But when I looked at the CREATE TABLE statement in lambda it had actually been applied as:

'input.regex'='([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" \\\"([^\\\"]*)\\\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \\\"([^\\\"]*)\\\" \\\"([^\\\"]*)\\\" \\\"([^\\\"]*)\\\" ([-.0-9]*) ([^ ]*) \\\"([^\\\"]*)\\\" \\\"([^\\\"]*)\\\" \\\"([^ ]*)\\\" \\\"([^\\s]+?)\\\" \\\"([^\\s]+)\\\" \\\"([^ ]*)\\\" \\\"([^ ]*)\\\"\n')

Here's the slice of HCL which generated the above:

parameters = {
        "serialization.format" = 1
        "input.regex"          = <<EOT
([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^\s]+?)\" \"([^\s]+)\" \"([^ ]*)\" \"([^ ]*)\"
EOT
      }

Not really sure where else to go, the raw string feature is looking really appealing right now.

mwarkentin commented 4 years ago

I have managed to get this working I think, and it is the case the the \"([^\\s]+?)\" should actually be \"([^s]+?)\"

not sure why the docs needed \s - maybe it needed to be escaped for some reason when being entered through the console

In any case, the final regex I used was ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) "([^ ]*) ([^ ]*) (- |[^ ]*)" "([^"]*)" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) "([^"]*)" "([^"]*)" "([^"]*)" ([-.0-9]*) ([^ ]*) "([^"]*)" "([^"]*)" "([^ ]*)" "([^s]+?)" "([^s]+)" "([^ ]*)" "([^ ]*)" being loaded in via file()

When rendering out the athena CREATE TABLE query from the table generated by the above, it became:

WITH SERDEPROPERTIES ( 
  'input.regex'='([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^ ]*)\" \"([^s]+?)\" \"([^s]+)\" \"([^ ]*)\" \"([^ ]*)\"')

DavidGamba commented 4 years ago

[^\s] means anything except white space. [^s] means anything except the s character. They are not the same.

bcanvural commented 1 year ago

Up. This is a fundamental need. I would consider such a feature high priority as it greatly affects developer experience

bkalcho commented 1 year ago

+1

m3adow commented 4 months ago

Working with REGEXP_EXTRACT again and it's disgusting: "process_name" = "REGEXP_EXTRACT(textPayload, \"\\\\((.*?)\\\\)\")" Using jsonencode is a crutch, but not a nice solution.
I'd love to have this feature implemented.

hashicorp / terraform