hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.81k stars 9.16k forks source link

[Bug]: Querying a "aws_glue_catalog_table" resource from Athena results in wrong result #31940

Open lpossamai opened 1 year ago

lpossamai commented 1 year ago

Terraform Core Version

1.5.0

AWS Provider Version

v5.3.0

Affected Resource(s)

aws_glue_catalog_table

Expected Behavior

When querying the table from the AWS Athena console, it should return data. But instead, it returns the following:

#   datehour
1   2023/06/01/00
2   2023/06/01/00
3   2023/06/01/02
4   2023/06/01/02
5   2023/06/01/07
6   2023/06/01/02
7   2023/06/01/02
8   2023/06/02/05
9   2023/06/01/02
10  2023/06/01/02

If I get the Table DDL, and create a secondary table directly from the AWS Console, I'm able to retrieve the data successfully. Which brings me to the conclusion this is a bug.

This screenshot shows the output from the table created by Terraform: image

This screenshot shows the output from the table created directly from the AWS console, but using the "Generate DDL" from the TF table: image

Actual Behavior

It is returning wrong data when querying the table from the AWS Athena console:

#   datehour
1   2023/06/01/00
2   2023/06/01/00
3   2023/06/01/02
4   2023/06/01/02
5   2023/06/01/07
6   2023/06/01/02
7   2023/06/01/02
8   2023/06/02/05
9   2023/06/01/02
10  2023/06/01/02

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

This configuration block creates an Athena table that pulls data from an existing S3 bucket.

resource "aws_glue_catalog_table" "aws_glue_catalog_table" {
  count         = terraform.workspace == "test" ? 1 : 0
  name          = "sms_events"
  database_name = aws_glue_catalog_database.aws_glue_catalog_database[count.index].name
  owner = "hadoop"

  table_type = "EXTERNAL_TABLE"

  storage_descriptor {
    location      = "s3://${module.kinesis_delivery_stream_s3_bucket[count.index].s3_bucket_id}/"
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    compressed    = false
    number_of_buckets = -1

    ser_de_info {
      serialization_library = "org.apache.hive.hcatalog.data.JsonSerDe"

      parameters = {
        "serialization.format" = 1
      }
    }

    skewed_info {
      skewed_column_names = []
      skewed_column_values = []
      skewed_column_value_location_maps = {}
    }

    columns {
      name = "event_type"
      type = "string"
      comment = "from deserializer"
    }

    columns {
      name = "event_timestamp"
      type = "string"
      comment = "from deserializer"
    }

    columns {
      name = "arrival_timestamp"
      type = "string"
      comment = "from deserializer"
    }

    columns {
      name = "event_version"
      type = "string"
      comment = "from deserializer"
    }

    columns {
      name = "application"
      type = "struct<app_id:string,sdk:string>"
      comment = "from deserializer"
    }

    columns {
      name = "client"
      type = "struct<client_id:string>"
      comment = "from deserializer"
    }

    columns {
      name = "device"
      type = "struct<platform:string>"
      comment = "from deserializer"
    }

    columns {
      name = "session"
      type = "string"
      comment = "from deserializer"
    }

    columns {
      name = "attributes"
      type = "struct<sender_request_id:string,campaign_activity_id:string,origination_phone_number:string,destination_phone_number:string,record_status:string,iso_country_code:string,treatment_id:bigint,number_of_message_parts:bigint,message_id:string,message_type:string,campaign_id:string>"
      comment = "from deserializer"
    }

    columns {
      name = "metrics"
      type = "struct<price_in_millicents_usd:string>"
      comment = "from deserializer"
    }

    columns {
      name = "awsaccountid"
      type = "bigint"
      comment = "from deserializer"
    }
  }

  partition_keys {
    name = "datehour"
    type = "string"
  }

  parameters = {
    EXTERNAL = "TRUE"
    "projection.datehour.type"          = "date"
    "projection.datehour.range"         = "2022/02/01/00,NOW"
    "projection.datehour.format"        = "yyyy/MM/dd/HH"
    "projection.datehour.interval"      = "1"
    "projection.datehour.interval.unit" = "HOURS"
    "projection.enabled"                = "true"
    "storage.location.template"         = "s3://${module.kinesis_delivery_stream_s3_bucket[count.index].s3_bucket_id}/$${datehour}/"
  }
}

Steps to Reproduce

  1. Create a new S3 bucket in the following format (to simulate Kinesis data injection for SMS Events): s3://s3bucketname/2023/06/14/00/pinpoint-firehose-http-delivery-stream-test-6-2023-06-14-00-01-35-ab06043f-162b-4942-a470-524da1097bb5.gz
  2. Create a new aws_glue_catalog_database
resource "aws_glue_catalog_database" "example" {
  name  = "example"
  create_table_default_permission {
    permissions = ["SELECT"]

    principal {
      data_lake_principal_identifier = "IAM_ALLOWED_PRINCIPALS"
    }
  }
}
  1. Create a new aws_glue_catalog_table
resource "aws_glue_catalog_table" "example" {
  name          = "sms_events"
  database_name = aws_glue_catalog_database.example.name
  owner = "hadoop"

  table_type = "EXTERNAL_TABLE"

  storage_descriptor {
    location      = "s3://s3bucketname/"
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    compressed    = false
    number_of_buckets = -1

    ser_de_info {
      serialization_library = "org.apache.hive.hcatalog.data.JsonSerDe"

      parameters = {
        "serialization.format" = 1
      }
    }

    skewed_info {
      skewed_column_names = []
      skewed_column_values = []
      skewed_column_value_location_maps = {}
    }

    columns {
      name = "event_type"
      type = "string"
      comment = "from deserializer"
    }

    columns {
      name = "event_timestamp"
      type = "string"
      comment = "from deserializer"
    }

    columns {
      name = "arrival_timestamp"
      type = "string"
      comment = "from deserializer"
    }

    columns {
      name = "event_version"
      type = "string"
      comment = "from deserializer"
    }

    columns {
      name = "application"
      type = "struct<app_id:string,sdk:string>"
      comment = "from deserializer"
    }

    columns {
      name = "client"
      type = "struct<client_id:string>"
      comment = "from deserializer"
    }

    columns {
      name = "device"
      type = "struct<platform:string>"
      comment = "from deserializer"
    }

    columns {
      name = "session"
      type = "string"
      comment = "from deserializer"
    }

    columns {
      name = "attributes"
      type = "struct<sender_request_id:string,campaign_activity_id:string,origination_phone_number:string,destination_phone_number:string,record_status:string,iso_country_code:string,treatment_id:bigint,number_of_message_parts:bigint,message_id:string,message_type:string,campaign_id:string>"
      comment = "from deserializer"
    }

    columns {
      name = "metrics"
      type = "struct<price_in_millicents_usd:string>"
      comment = "from deserializer"
    }

    columns {
      name = "awsaccountid"
      type = "bigint"
      comment = "from deserializer"
    }
  }

  partition_keys {
    name = "datehour"
    type = "string"
  }

  parameters = {
    EXTERNAL = "TRUE"
    "projection.datehour.type"          = "date"
    "projection.datehour.range"         = "2022/02/01/00,NOW"
    "projection.datehour.format"        = "yyyy/MM/dd/HH"
    "projection.datehour.interval"      = "1"
    "projection.datehour.interval.unit" = "HOURS"
    "projection.enabled"                = "true"
    "storage.location.template"         = "s3://s3bucketname/$${datehour}/"
  }
}
  1. Go to the AWS Athena console, and query that new table: select * from sms_events limit 10;

The pinpoint-firehose-http-delivery-stream-test-6-2023-06-14-00-01-35-ab06043f-162b-4942-a470-524da1097bb5.gz (json) file should look like this:

{
    "application": {
        "app_id": "d9dc",
        "sdk": {}
    },
    "application": {
        "app_id": "d9dc",
        "sdk": {}
    },
    "arrival_timestamp": 1686614411026,
    "arrival_timestamp": 1686614407595,
    "attributes": {
        "feedback": "delivered",
        "tags": null
    },
    "attributes": {
        "feedback": "received",
        "tags": null
    },
    "awsAccountId": "666",
    "awsAccountId": "666",
    "client": {
        "client_id": "3823ad9b"
    },
    "client": {
        "client_id": "f488aa0a"
    },
    "device": {
        "platform": {}
    },
    "device": {
        "platform": {}
    },
    "event_timestamp": 1686614407497,
    "event_timestamp": 1686614407497,
    "event_type": "_email.delivered",
    "event_type": "_email.send",
    "event_version": "3.1",
    "event_version": "3.1",
    "facets": {
        "email_channel": {
            "mail_event": {
                "delivery": {
                    "event_timestamp": 1686614411026,
                    "processing_time_millis": 3529,
                    "recipients": [
                        "example@example.com"
                    ],
                    "reporting_mta": "b232-5.smtp-out.ap-southeast-2.amazonses.com",
                    "smtp_response": "250 2.0.0 OK  1686614410 lg8-20020a170902fb8800b001a1bfe84f8dsi7493700plb.611 - gsmtp"
                },
                "mail": {
                    "common_headers": {
                        "date": "Tue, 13 Jun 2023 00:00:07 +0000",
                        "from": "MJRAU <noreply@example.com>",
                        "subject": "Your job 1008 with MJRAU is approaching",
                        "to": [
                            "example@example.com"
                        ]
                    },
                    "destination": [
                        "example@example.com"
                    ],
                    "from_address": "MJRAU<noreply@example.com>",
                    "headers": [
                        {
                            "name": "Date",
                            "value": "Tue, 13 Jun 2023 00:00:07 +0000"
                        },
                        {
                            "name": "From",
                            "value": "MJRAU <noreply@example.com>"
                        },
                        {
                            "name": "To",
                            "value": "example@example.com"
                        },
                        {
                            "name": "Subject",
                            "value": "Your job 1008 with MJRAU is approaching"
                        },
                        {
                            "name": "MIME-Version",
                            "value": "1.0"
                        },
                        {
                            "name": "Content-Type",
                            "value": "text/html; charset=UTF-8"
                        },
                        {
                            "name": "Content-Transfer-Encoding",
                            "value": "quoted-printable"
                        }
                    ],
                    "headers_truncated": false,
                    "message_id": "02080000c69ufks3",
                    "message_send_timestamp": 1686614407497
                }
            }
        }
    },
    "facets": {
        "email_channel": {
            "mail_event": {
                "mail": {
                    "common_headers": {
                        "from": "MJRAU <noreply@example.com>",
                        "to": [
                            "example@example.com"
                        ]
                    },
                    "destination": [
                        "example@example.com"
                    ],
                    "from_address": "MJRAU<noreply@example.com>",
                    "headers": [
                        {
                            "name": "From",
                            "value": "MJRAU <noreply@example.com>"
                        },
                        {
                            "name": "To",
                            "value": "example@example.com"
                        },
                        {
                            "name": "MIME-Version",
                            "value": "1.0"
                        }
                    ],
                    "headers_truncated": false,
                    "message_id": "02080000c69ufks3",
                    "message_send_timestamp": 1686614407497
                },
                "send": {}
            }
        }
    },
    "session": {},
    "session": {}
}
}
}

Note that if I manually create the table from the AWS Athena console, and then import the table into Terraform, everything works as expected.

Once the table has been imported to TF, the terraform plan output looks like this:

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # aws_glue_catalog_table.aws_glue_catalog_table[0] will be updated in-place
  ~ resource "aws_glue_catalog_table" "aws_glue_catalog_table" {
        id            = "666:example:sms_events"
        name          = "sms_events"
      ~ parameters    = {
          - "transient_lastDdlTime"             = "1686715211" -> null
            # (8 unchanged elements hidden)
        }
        # (6 unchanged attributes hidden)

        # (2 unchanged blocks hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

No

github-actions[bot] commented 1 year ago

Community Note

Voting for Prioritization

Volunteering to Work on This Issue