Open jhole89 opened 4 years ago
Supporting table would be awesome.
However our use case would be to use terraform to store the code that creates our tables. Athena will generated DDL code like that:
From a csv file:
CREATE EXTERNAL TABLE `locations`(
`facility` string,
`latitude` float,
`longitude` float,
`city` string,
`postcode` string,
`totalbays` int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://xxxx/xxxx/xxxmetadata'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1615471970')
From an avro file:
CREATE EXTERNAL TABLE `queen`(
`transactionid` int COMMENT 'from deserializer',
`licenseplate` string COMMENT 'from deserializer',
`sitecode` string COMMENT 'from deserializer',
`facility` string COMMENT 'from deserializer',
`subscriptionmodel` string COMMENT 'from deserializer',
`entry_dt_utc` bigint COMMENT 'from deserializer',
`entry_dt_local` bigint COMMENT 'from deserializer',
`exit_dt_utc` bigint COMMENT 'from deserializer',
`exit_dt_local` bigint COMMENT 'from deserializer',
`paymentamount` float COMMENT 'from deserializer',
`reductionamount` float COMMENT 'from deserializer',
`invoiceamount` float COMMENT 'from deserializer',
`citypasstransactionamount` float COMMENT 'from deserializer',
`paymethod` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
's3://xxxx/xxxx/xxxxx'
TBLPROPERTIES (
'transient_lastDdlTime'='1615400679')
which we'd love to be able to use directly in terraform. that would save us writing the entire description of the file's schema
@zizipoil @jhole89 cant we use aws_glue_catalog_table and use in Athena ?
cant we use aws_glue_catalog_table and use in Athena ?
@ahaffar that would not support partition projection https://docs.aws.amazon.com/athena/latest/ug/partition-projection-setting-up.html Use case: https://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html#create-cloudtrail-table-partition-projection
@jkrnak @ahaffar, I was able to create a glue table with partition projection using the parameters
argument ("projection.enabled" = "true"
). This showed as a partitioned
table in Athena and queried fine. Perhaps all that's needed is an update to the docs?
Any update on this feature ? Is it in progress ?
Unfortunately this approach doesn't work for all use cases.
Eg WAF logs, which has nested JSON data.
Create table statement: https://docs.aws.amazon.com/athena/latest/ug/waf-logs.html
When converting into Terraform using aws_glue_catalog_table
resource and applying, I get errors like:
Error: creating Glue Catalog Table (waf_logs_tf): ValidationException: 8 validation errors detected: Value ' array < struct < conditiontype: string, sensitivitylevel: string, location: string, matcheddata: array < string > > > ' at 'table.storageDescriptor.columns.7.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' array < struct < rulegroupid: string, terminatingrule: struct < ruleid: string, action: string, rulematchdetails: array < struct < conditiontype: string, sensitivitylevel: string, location: string, matcheddata: array < string > > > >, nonterminatingmatchingrules: array < struct < ruleid: string, action: string, rulematchdetails: array < struct < conditiontype: string, sensitivitylevel: string, location: string, matcheddata: array < string > > > > >, excludedrules: string > > ' at 'table.storageDescriptor.columns.10.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' array < struct < ratebasedruleid: string, limitkey: string, maxrateallowed: int > > ' at 'table.storageDescriptor.columns.11.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' array < struct < ruleid: string, action: string, rulematchdetails: array < struct < conditiontype: string, sensitivitylevel: string, location: string, matcheddata: array < string > > >, captcharesponse: struct < responsecode: string, solvetimestamp: string > > > ' at 'table.storageDescriptor.columns.12.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' array < struct < name: string, value: string > > ' at 'table.storageDescriptor.columns.13.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' struct < clientip: string, country: string, headers: array < struct < name: string, value: string > >, uri: string, args: string, httpversion: string, httpmethod: string, requestid: string > ' at 'table.storageDescriptor.columns.15.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' array < struct < name: string > > ' at 'table.storageDescriptor.columns.16.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*; Value ' struct < responsecode: string, solvetimestamp: string, failureReason: string > ' at 'table.storageDescriptor.columns.17.member.type' failed to satisfy constraint: Member must satisfy regular expression pattern: [\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF\t]*
Perhaps a separate issue?
I have this working for WAF v2
resource "aws_glue_catalog_table" "waf_log" {
name = "projected_partition"
database_name = aws_athena_database.waf_log.id
table_type = "EXTERNAL_TABLE"
parameters = {
"EXTERNAL" = "TRUE"
"has_encrypted_data" = "false"
"projection.day.digits" = "2"
"projection.day.range" = "01,31"
"projection.day.type" = "integer"
"projection.enabled" = "true"
"projection.hour.digits" = "2"
"projection.hour.range" = "00,23"
"projection.hour.type" = "integer"
"projection.month.digits" = "2"
"projection.month.range" = "01,12"
"projection.month.type" = "integer"
"projection.year.digits" = "4"
"projection.year.range" = "2021,2042"
"projection.year.type" = "integer"
"storage.location.template" = "s3://${aws_s3_bucket.waf_log_bucket.id}/$${year}/$${month}/$${day}/$${hour}"
}
partition_keys {
name = "year"
type = "int"
}
partition_keys {
name = "month"
type = "int"
}
partition_keys {
name = "day"
type = "int"
}
partition_keys {
name = "hour"
type = "int"
}
storage_descriptor {
location = "s3://${aws_s3_bucket.waf_log_bucket.id}"
input_format = "org.apache.hadoop.mapred.TextInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
ser_de_info {
serialization_library = "org.openx.data.jsonserde.JsonSerDe"
}
columns {
name = "timestamp"
type = "bigint"
}
columns {
name = "formatversion"
type = "int"
}
columns {
name = "webaclid"
type = "string"
}
columns {
name = "terminatingruleid"
type = "string"
}
columns {
name = "terminatingruletype"
type = "string"
}
columns {
name = "action"
type = "string"
}
columns {
name = "terminatingrulematchdetails"
type = "array<struct<conditiontype:string,location:string,matcheddata:array<string>>>"
}
columns {
name = "httpsourcename"
type = "string"
}
columns {
name = "httpsourceid"
type = "string"
}
columns {
name = "rulegrouplist"
type = "array<struct<rulegroupid:string,terminatingrule:struct<ruleid:string,action:string,rulematchdetails:string>,nonterminatingmatchingrules:array<struct<ruleid:string,action:string,rulematchdetails:array<struct<conditiontype:string,location:string,matcheddata:array<string>>>>>,excludedrules:array<struct<ruleid:string,exclusiontype:string>>>>"
}
columns {
name = "ratebasedrulelist"
type = "array<struct<ratebasedruleid:string,ratebasedrulename:string,limitkey:string,maxrateallowed:int,limitvalue:string>>"
}
columns {
name = "nonterminatingmatchingrules"
type = "array<struct<ruleid:string,action:string,rulematchdetails:array<struct<conditiontype:string,location:string,matcheddata:array<string>>>,captcharesponse:struct<responsecode:string,solvetimestamp:bigint>>>"
}
columns {
name = "requestheadersinserted"
type = "string"
}
columns {
name = "responsecodesent"
type = "string"
}
columns {
name = "httprequest"
type = "struct<clientip:string,country:string,headers:array<struct<name:string,value:string>>,uri:string,args:string,httpversion:string,httpmethod:string,requestid:string>"
}
columns {
name = "labels"
type = "array<struct<name:string>>"
}
columns {
name = "captcharesponse"
type = "struct<responsecode:string,solvetimestamp:bigint,failurereason:string>"
}
}
}
Thanks @stewartcampbell for the comment, super helpful. You encouraged me to dig a little deeper on my issue... and got it to work.
Turned out that my issue was using Terraform HEREDOC for defining the complex column types:
I noticed some small differences between the schema that AWS has published, and the schema you provided. Examples (maybe more):
terminatingrulematchdetails
, the sensitivityLevel
is missing in your schemarulegrouplist.rulematchdetails
is a complex type in AWS, in your schema it is string
Below is my solution, based on the current AWS spec, in case it is useful to someone.
resource "aws_glue_catalog_table" "waf_logs" {
database_name = "default"
name = "waf_logs"
parameters = {
"EXTERNAL" = "TRUE"
# "has_encrypted_data" = "false"
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2023/01/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"storage.location.template" = "s3://${local.mybucket}/$${date}/"
}
table_type = "EXTERNAL_TABLE"
partition_keys {
name = "date"
type = "string"
}
storage_descriptor {
input_format = "org.apache.hadoop.mapred.TextInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
ser_de_info {
serialization_library = "org.openx.data.jsonserde.JsonSerDe"
}
location = "s3://${local.mybucket}/"
columns {
name = "timestamp"
type = "bigint"
}
columns {
name = "formatversion"
type = "int"
}
columns {
name = "webaclid"
type = "string"
}
columns {
name = "terminatingruleid"
type = "string"
}
columns {
name = "terminatingruletype"
type = "string"
}
columns {
name = "action"
type = "string"
}
columns {
name = "terminatingrulematchdetails"
type = replace(replace(<<EOF
array <
struct <
conditiontype: string,
sensitivitylevel: string,
location: string,
matcheddata: array < string >
>
>
EOF
, "\n", ""), "/\\s+/", "")
}
columns {
name = "httpsourcename"
type = "string"
}
columns {
name = "httpsourceid"
type = "string"
}
columns {
name = "rulegrouplist"
type = replace(replace(<<EOF
array <
struct <
rulegroupid: string,
terminatingrule: struct <
ruleid: string,
action: string,
rulematchdetails: array <
struct<
conditiontype: string,
sensitivitylevel: string,
location: string,
matcheddata: array < string >
>
>
>,
nonterminatingmatchingrules: array <
struct<
ruleid: string,
action: string,
rulematchdetails: array <
struct <
conditiontype: string,
sensitivitylevel: string,
location: string,
matcheddata: array < string >
>
>
>
>,
excludedrules: string
>
>
EOF
, "\n", ""), "/\\s+/", "")
}
columns {
name = "ratebasedrulelist"
type = replace(replace(<<EOF
array <
struct <
ratebasedruleid: string,
limitkey: string,
maxrateallowed: int
>
>
EOF
, "\n", ""), "/\\s+/", "")
}
columns {
name = "nonterminatingmatchingrules"
type = replace(replace(<<EOF
array <
struct <
ruleid: string,
action: string,
rulematchdetails: array <
struct <
conditiontype: string,
sensitivitylevel: string,
location: string,
matcheddata: array < string >
>
>,
captcharesponse: struct <
responsecode: string,
solvetimestamp: string
>
>
>
EOF
, "\n", ""), "/\\s+/", "")
}
columns {
name = "requestheadersinserted"
type = replace(replace(<<EOF
array <
struct <
name: string,
value: string
>
>
EOF
, "\n", ""), "/\\s+/", "")
}
columns {
name = "responsecodesent"
type = "string"
}
columns {
name = "httprequest"
type = replace(replace(<<EOF
struct <
clientip: string,
country: string,
headers: array <
struct <
name: string,
value: string
>
>,
uri: string,
args: string,
httpversion: string,
httpmethod: string,
requestid: string
>
EOF
, "\n", ""), "/\\s+/", "")
}
columns {
name = "labels"
type = replace(replace(<<EOF
array <
struct <
name: string
>
>
EOF
, "\n", ""), "/\\s+/", "")
}
columns {
name = "captcharesponse"
type = replace(replace(<<EOF
struct <
responsecode: string,
solvetimestamp: string,
failureReason: string
>
EOF
, "\n", ""), "/\\s+/", "")
}
}
}
This resource would also be useful for declaring Iceberg tables. StackOverflow related posts: https://stackoverflow.com/questions/75383898/issues-when-i-try-to-configure-an-aws-athena-iceberg-table-using-terraform?noredirect=1&lq=1 https://stackoverflow.com/questions/75581933/how-to-deploy-iceberg-tables-to-aws-through-terraform
Hi all 👋 I wanted to comment here based on the significant interest for this feature. AWS does not offer a direct way to create and manage an Athena table through the AWS Go SDK, the library upon which the provider interacts with AWS. As a result, adding the feature to the provider would require translation of Terraform Schema into DDL and vice versa, and issuing the operations against a native SQL/JDBC/ODBC driver. This is not something this provider is designed to do, nor is it something the maintainer team would feel comfortable supporting.
For that reason, we consider this feature request blocked upstream, and will revisit if that functionality is released by AWS. However, as others on this thread have demonstrated, Athena table creation can be accomplished by using the aws_glue_catalog_data_table
resource. We understand this is not exactly answer you are looking for, but hope you understand our reasoning.
One way to do this is using local-exec
provisioner and null_resource
to run DDL. Example here => https://github.com/WarFox/terraform-iceberg
Community Note
Description
New resource for AWS Athena Table
New or Affected Resource(s)
Potential Terraform Configuration
References
1486