Feature Request: Support Athena Tables

jhole89 commented 4 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Description

New resource for AWS Athena Table

New or Affected Resource(s)

aws_athena_table

Potential Terraform Configuration

resource "aws_athena_table" "myTable" {
  name   = "users"
  bucket = "${aws_s3_bucket.myBucket.bucket}"
  database = "${aws_athena_database.myDatabase.name}"
  external = true
  comment = "my table comment"
  row_format = "SERDE"
  serde_properties {
    name = "org.apache.hadoop.hive.serde2.RegexSerDe"
  }
  stored_as = "TEXTFILE"
  table_properties {
  ...
  }
  columns {
    name = "my_string"
    type = "string"
  }

  columns {
    name = "my_double"
    type = "double"
  }

  columns {
    name    = "my_date"
    type    = "date"
    comment = ""
  }

  columns {
    name    = "my_bigint"
    type    = "bigint"
    comment = ""
  }

  columns {
    name    = "my_struct"
    type    = "struct<my_nested_string:string>"
    comment = ""
  }

  partitions {
    name    = "my_date"
    type    = "date"
    comment = ""
  }
}

References

raphaelfavier commented 3 years ago

Supporting table would be awesome.

However our use case would be to use terraform to store the code that creates our tables. Athena will generated DDL code like that:

From a csv file:

CREATE EXTERNAL TABLE `locations`(
  `facility` string, 
  `latitude` float, 
  `longitude` float, 
  `city` string, 
  `postcode` string, 
  `totalbays` int)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://xxxx/xxxx/xxxmetadata'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1615471970')

From an avro file:

CREATE EXTERNAL TABLE `queen`(
  `transactionid` int COMMENT 'from deserializer', 
  `licenseplate` string COMMENT 'from deserializer', 
  `sitecode` string COMMENT 'from deserializer', 
  `facility` string COMMENT 'from deserializer', 
  `subscriptionmodel` string COMMENT 'from deserializer', 
  `entry_dt_utc` bigint COMMENT 'from deserializer', 
  `entry_dt_local` bigint COMMENT 'from deserializer', 
  `exit_dt_utc` bigint COMMENT 'from deserializer', 
  `exit_dt_local` bigint COMMENT 'from deserializer', 
  `paymentamount` float COMMENT 'from deserializer', 
  `reductionamount` float COMMENT 'from deserializer', 
  `invoiceamount` float COMMENT 'from deserializer', 
  `citypasstransactionamount` float COMMENT 'from deserializer', 
  `paymethod` string COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
  's3://xxxx/xxxx/xxxxx'
TBLPROPERTIES (
  'transient_lastDdlTime'='1615400679')

which we'd love to be able to use directly in terraform. that would save us writing the entire description of the file's schema

ahaffar commented 3 years ago

@zizipoil @jhole89 cant we use aws_glue_catalog_table and use in Athena ?

jkrnak commented 3 years ago

cant we use aws_glue_catalog_table and use in Athena ?

@ahaffar that would not support partition projection https://docs.aws.amazon.com/athena/latest/ug/partition-projection-setting-up.html Use case: https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html#create-cloudtrail-table-partition-projection

tlvince commented 2 years ago

@jkrnak @ahaffar, I was able to create a glue table with partition projection using the parameters argument ("projection.enabled" = "true"). This showed as a partitioned table in Athena and queried fine. Perhaps all that's needed is an update to the docs?

Example for CloudWatch logs

```hcl resource "aws_glue_catalog_table" "cloudwatch_logs" { database_name = "default" name = "cloudwatch-example" owner = "hadoop" parameters = { "EXTERNAL" = "TRUE" "projection.enabled" = "true" "projection.log_date.format" = "yyyy/MM/dd" "projection.log_date.interval" = "1" "projection.log_date.interval.unit" = "DAYS" "projection.log_date.range" = "2019/09/10,NOW" "projection.log_date.type" = "date" "skip.header.line.count" = "2" "storage.location.template" = "s3://cloudwatch-logs/by-date/$${log_date}/" } retention = 0 table_type = "EXTERNAL_TABLE" partition_keys { name = "log_date" type = "string" } storage_descriptor { input_format = "org.apache.hadoop.mapred.TextInputFormat" location = "s3://cloudwatch-logs/by-date" output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" columns { name = "date" type = "date" } columns { name = "time" type = "string" } columns { name = "location" type = "string" } columns { name = "bytes" type = "bigint" } columns { name = "request_ip" type = "string" } columns { name = "method" type = "string" } columns { name = "host" type = "string" } columns { name = "uri" type = "string" } columns { name = "status" type = "int" } columns { name = "referrer" type = "string" } columns { name = "user_agent" type = "string" } columns { name = "query_string" type = "string" } columns { name = "cookie" type = "string" } columns { name = "result_type" type = "string" } columns { name = "request_id" type = "string" } columns { name = "host_header" type = "string" } columns { name = "request_protocol" type = "string" } columns { name = "request_bytes" type = "bigint" } columns { name = "time_taken" type = "float" } columns { name = "xforwarded_for" type = "string" } columns { name = "ssl_protocol" type = "string" } columns { name = "ssl_cipher" type = "string" } columns { name = "response_result_type" type = "string" } columns { name = "http_version" type = "string" } columns { name = "fle_status" type = "string" } columns { name = "fle_encrypted_fields" type = "int" } columns { name = "c_port" type = "int" } columns { name = "time_to_first_byte" type = "float" } columns { name = "x_edge_detailed_result_type" type = "string" } columns { name = "sc_content_type" type = "string" } columns { name = "sc_content_len" type = "bigint" } columns { name = "sc_range_start" type = "bigint" } columns { name = "sc_range_end" type = "bigint" } ser_de_info { parameters = { "field.delim" = "\t" "serialization.format" = "\t" } serialization_library = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" } } } ```

ashgit301 commented 2 years ago

Any update on this feature ? Is it in progress ?

baolsen commented 1 year ago

Unfortunately this approach doesn't work for all use cases.

Eg WAF logs, which has nested JSON data.

Create table statement: https://docs.aws.amazon.com/athena/latest/ug/waf-logs.html

When converting into Terraform using aws_glue_catalog_table resource and applying, I get errors like:

Error: creating Glue Catalog Table (waf_logs_tf): ValidationException: 8 validation errors detected: Value ' array < struct < conditiontype: string, sensitivitylevel: string, location: string, matcheddata: array < string > > > ' at 'table.storageDescriptor.columns.7.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' array < struct < rulegroupid: string, terminatingrule: struct < ruleid: string, action: string, rulematchdetails: array < struct < conditiontype: string, sensitivitylevel: string, location: string, matcheddata: array < string > > > >, nonterminatingmatchingrules: array < struct < ruleid: string, action: string, rulematchdetails: array < struct < conditiontype: string, sensitivitylevel: string, location: string, matcheddata: array < string > > > > >, excludedrules: string > > ' at 'table.storageDescriptor.columns.10.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' array < struct < ratebasedruleid: string, limitkey: string, maxrateallowed: int > > ' at 'table.storageDescriptor.columns.11.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' array < struct < ruleid: string, action: string, rulematchdetails: array < struct < conditiontype: string, sensitivitylevel: string, location: string, matcheddata: array < string > > >, captcharesponse: struct < responsecode: string, solvetimestamp: string > > > ' at 'table.storageDescriptor.columns.12.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' array < struct < name: string, value: string > > ' at 'table.storageDescriptor.columns.13.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' struct < clientip: string, country: string, headers: array < struct < name: string, value: string > >, uri: string, args: string, httpversion: string, httpmethod: string, requestid: string > ' at 'table.storageDescriptor.columns.15.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' array < struct < name: string > > ' at 'table.storageDescriptor.columns.16.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' struct < responsecode: string, solvetimestamp: string, failureReason: string > ' at 'table.storageDescriptor.columns.17.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*

Perhaps a separate issue?

stewartcampbell commented 1 year ago

I have this working for WAF v2

resource "aws_glue_catalog_table" "waf_log" {
  name          = "projected_partition"
  database_name = aws_athena_database.waf_log.id
  table_type    = "EXTERNAL_TABLE"
  parameters = {
    "EXTERNAL"                  = "TRUE"
    "has_encrypted_data"        = "false"
    "projection.day.digits"     = "2"
    "projection.day.range"      = "01,31"
    "projection.day.type"       = "integer"
    "projection.enabled"        = "true"
    "projection.hour.digits"    = "2"
    "projection.hour.range"     = "00,23"
    "projection.hour.type"      = "integer"
    "projection.month.digits"   = "2"
    "projection.month.range"    = "01,12"
    "projection.month.type"     = "integer"
    "projection.year.digits"    = "4"
    "projection.year.range"     = "2021,2042"
    "projection.year.type"      = "integer"
    "storage.location.template" = "s3://${aws_s3_bucket.waf_log_bucket.id}/$${year}/$${month}/$${day}/$${hour}"
  }
  partition_keys {
    name = "year"
    type = "int"
  }
  partition_keys {
    name = "month"
    type = "int"
  }
  partition_keys {
    name = "day"
    type = "int"
  }
  partition_keys {
    name = "hour"
    type = "int"
  }
  storage_descriptor {
    location      = "s3://${aws_s3_bucket.waf_log_bucket.id}"
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    ser_de_info {
      serialization_library = "org.openx.data.jsonserde.JsonSerDe"
    }
    columns {
      name = "timestamp"
      type = "bigint"
    }
    columns {
      name = "formatversion"
      type = "int"
    }
    columns {
      name = "webaclid"
      type = "string"
    }
    columns {
      name = "terminatingruleid"
      type = "string"
    }
    columns {
      name = "terminatingruletype"
      type = "string"
    }
    columns {
      name = "action"
      type = "string"
    }
    columns {
      name = "terminatingrulematchdetails"
      type = "array<struct<conditiontype:string,location:string,matcheddata:array<string>>>"
    }
    columns {
      name = "httpsourcename"
      type = "string"
    }
    columns {
      name = "httpsourceid"
      type = "string"
    }
    columns {
      name = "rulegrouplist"
      type = "array<struct<rulegroupid:string,terminatingrule:struct<ruleid:string,action:string,rulematchdetails:string>,nonterminatingmatchingrules:array<struct<ruleid:string,action:string,rulematchdetails:array<struct<conditiontype:string,location:string,matcheddata:array<string>>>>>,excludedrules:array<struct<ruleid:string,exclusiontype:string>>>>"
    }
    columns {
      name = "ratebasedrulelist"
      type = "array<struct<ratebasedruleid:string,ratebasedrulename:string,limitkey:string,maxrateallowed:int,limitvalue:string>>"
    }
    columns {
      name = "nonterminatingmatchingrules"
      type = "array<struct<ruleid:string,action:string,rulematchdetails:array<struct<conditiontype:string,location:string,matcheddata:array<string>>>,captcharesponse:struct<responsecode:string,solvetimestamp:bigint>>>"
    }
    columns {
      name = "requestheadersinserted"
      type = "string"
    }
    columns {
      name = "responsecodesent"
      type = "string"
    }
    columns {
      name = "httprequest"
      type = "struct<clientip:string,country:string,headers:array<struct<name:string,value:string>>,uri:string,args:string,httpversion:string,httpmethod:string,requestid:string>"
    }
    columns {
      name = "labels"
      type = "array<struct<name:string>>"
    }
    columns {
      name = "captcharesponse"
      type = "struct<responsecode:string,solvetimestamp:bigint,failurereason:string>"
    }
  }
}

baolsen commented 1 year ago

Thanks @stewartcampbell for the comment, super helpful. You encouraged me to dig a little deeper on my issue... and got it to work.

Turned out that my issue was using Terraform HEREDOC for defining the complex column types:

I needed to strip out newline characters for the validation to pass.
It applied but Athena couldn't read the table. After stripping out spaces as well, it then worked.

I noticed some small differences between the schema that AWS has published, and the schema you provided. Examples (maybe more):

In terminatingrulematchdetails, the sensitivityLevel is missing in your schema
rulegrouplist.rulematchdetails is a complex type in AWS, in your schema it is string

Below is my solution, based on the current AWS spec, in case it is useful to someone.

resource "aws_glue_catalog_table" "waf_logs" {
  database_name = "default"
  name          = "waf_logs"

  parameters = {
    "EXTERNAL" = "TRUE"
    # "has_encrypted_data"            = "false"
    "projection.enabled"            = "true",
    "projection.date.type"          = "date",
    "projection.date.range"         = "2023/01/01,NOW",
    "projection.date.format"        = "yyyy/MM/dd",
    "projection.date.interval"      = "1",
    "projection.date.interval.unit" = "DAYS",
    "storage.location.template"     = "s3://${local.mybucket}/$${date}/"
  }

  table_type = "EXTERNAL_TABLE"

  partition_keys {
    name = "date"
    type = "string"
  }

  storage_descriptor {
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

    ser_de_info {
      serialization_library = "org.openx.data.jsonserde.JsonSerDe"
    }

    location = "s3://${local.mybucket}/"

    columns {
      name = "timestamp"
      type = "bigint"
    }

    columns {
      name = "formatversion"
      type = "int"
    }

    columns {
      name = "webaclid"
      type = "string"
    }

    columns {
      name = "terminatingruleid"
      type = "string"
    }

    columns {
      name = "terminatingruletype"
      type = "string"
    }

    columns {
      name = "action"
      type = "string"
    }

    columns {
      name = "terminatingrulematchdetails"
      type = replace(replace(<<EOF
        array <
          struct <
            conditiontype: string,
            sensitivitylevel: string,
            location: string,
            matcheddata: array < string >
          >
        >
      EOF
      , "\n", ""), "/\\s+/", "")
    }

    columns {
      name = "httpsourcename"
      type = "string"
    }

    columns {
      name = "httpsourceid"
      type = "string"
    }

    columns {
      name = "rulegrouplist"
      type = replace(replace(<<EOF
        array <
          struct <
            rulegroupid: string,
            terminatingrule: struct <
              ruleid: string,
              action: string,
              rulematchdetails: array <
                struct<
                  conditiontype: string,
                  sensitivitylevel: string,
                  location: string,
                  matcheddata: array < string >
                >
              >
            >,
            nonterminatingmatchingrules: array <
              struct<
                ruleid: string,
                action: string,
                rulematchdetails: array <
                  struct <
                    conditiontype: string,
                    sensitivitylevel: string,
                    location: string,
                    matcheddata: array < string >
                  >
                >
              >
            >,
            excludedrules: string
          >
        >
      EOF
      , "\n", ""), "/\\s+/", "")
    }

    columns {
      name = "ratebasedrulelist"
      type = replace(replace(<<EOF
        array <
          struct <
            ratebasedruleid: string,
            limitkey: string,
            maxrateallowed: int
          >
        >
      EOF
      , "\n", ""), "/\\s+/", "")
    }

    columns {
      name = "nonterminatingmatchingrules"
      type = replace(replace(<<EOF
        array <
          struct <
            ruleid: string,
            action: string,
            rulematchdetails: array <
              struct <
                conditiontype: string,
                sensitivitylevel: string,
                location: string,
                matcheddata: array < string >
              >
            >,
            captcharesponse: struct <
              responsecode: string,
              solvetimestamp: string
            >
          >
        >
      EOF
      , "\n", ""), "/\\s+/", "")
    }

    columns {
      name = "requestheadersinserted"
      type = replace(replace(<<EOF
        array <
          struct <
            name: string,
            value: string
          >
        >
      EOF
      , "\n", ""), "/\\s+/", "")
    }

    columns {
      name = "responsecodesent"
      type = "string"
    }

    columns {
      name = "httprequest"
      type = replace(replace(<<EOF
        struct <
          clientip: string,
          country: string,
          headers: array <
            struct <
              name: string,
              value: string
            >
          >,
          uri: string,
          args: string,
          httpversion: string,
          httpmethod: string,
          requestid: string
        >
      EOF
      , "\n", ""), "/\\s+/", "")
    }

    columns {
      name = "labels"
      type = replace(replace(<<EOF
        array <
          struct <
            name: string
          >
        >
      EOF
      , "\n", ""), "/\\s+/", "")
    }

    columns {
      name = "captcharesponse"
      type = replace(replace(<<EOF
        struct <
          responsecode: string,
          solvetimestamp: string,
          failureReason: string
        >
      EOF
      , "\n", ""), "/\\s+/", "")
    }
  }
}

pedrobento988 commented 1 year ago

This resource would also be useful for declaring Iceberg tables. StackOverflow related posts: https://stackoverflow.com/questions/75383898/issues-when-i-try-to-configure-an-aws-athena-iceberg-table-using-terraform?noredirect=1&lq=1 https://stackoverflow.com/questions/75581933/how-to-deploy-iceberg-tables-to-aws-through-terraform

breathingdust commented 8 months ago

Hi all 👋 I wanted to comment here based on the significant interest for this feature. AWS does not offer a direct way to create and manage an Athena table through the AWS Go SDK, the library upon which the provider interacts with AWS. As a result, adding the feature to the provider would require translation of Terraform Schema into DDL and vice versa, and issuing the operations against a native SQL/JDBC/ODBC driver. This is not something this provider is designed to do, nor is it something the maintainer team would feel comfortable supporting.

For that reason, we consider this feature request blocked upstream, and will revisit if that functionality is released by AWS. However, as others on this thread have demonstrated, Athena table creation can be accomplished by using the aws_glue_catalog_data_table resource. We understand this is not exactly answer you are looking for, but hope you understand our reasoning.

WarFox commented 6 months ago

One way to do this is using local-exec provisioner and null_resource to run DDL. Example here => https://github.com/WarFox/terraform-iceberg

hashicorp / terraform-provider-aws