Closed rshamunov closed 1 month ago
Hi @rshamunov!
Looking in your code I noticed that you are not using the correct format as terraform registry describes.
Try to use something like this(watch the source_uri_prefix
and source_uris
arguments, both are using the bucket name but following the terraform registry format):
resource "google_bigquery_dataset" "bq_ds_19177" {
dataset_id = "bq_ds_19177"
friendly_name = "BQ DS 19177"
description = "something"
location = "EU"
default_table_expiration_ms = 3600000
labels = {
env = "something"
}
}
resource "google_storage_bucket" "bucket_19177" {
name = "bucket-19177"
location = "US"
}
resource "google_bigquery_table" "bq_table_19177" {
dataset_id = google_bigquery_dataset.bq_ds_19177.dataset_id
project = "my-project"
table_id = "bq-table-19177"
deletion_protection = false
labels = {
l1 = "hello"
}
external_data_configuration {
autodetect = false
source_format = "CSV"
source_uris = [
"gs://${google_storage_bucket.bucket_19177.name}/*"
]
max_bad_records = 0
ignore_unknown_values = false
csv_options {
field_delimiter = ","
skip_leading_rows = 1
quote = "\""
}
hive_partitioning_options {
mode = "CUSTOM"
source_uri_prefix = "gs://${google_storage_bucket.bucket_19177.name}/{key1:STRING}"
require_partition_filter = false
}
schema = <<EOF
[
{
"name": "Id",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "Name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "Weight",
"type": "INTEGER",
"mode": "REQUIRED"
}
]
EOF
}
}
Hello @ggtisc. Here is the TF configuration with variable names replaced to actual values in my test project:
resource "google_bigquery_table" "test-hived-table" {
dataset_id = "example_dataset"
project = "sh-unmanaged-tests"
table_id = "test-hived-table-tf"
deletion_protection = false
labels = {
l1 = "hello_again"
}
external_data_configuration {
autodetect = false
source_format = "CSV"
source_uris = [
"gs://sh-unmanaged-tests-bucket/example/*"
]
max_bad_records = 0
ignore_unknown_values = false
csv_options {
field_delimiter = ","
skip_leading_rows = 1
quote = "\""
}
hive_partitioning_options {
mode = "CUSTOM"
source_uri_prefix = "gs://sh-unmanaged-tests-bucket/example/{day:DATE}"
require_partition_filter = false
}
schema = <<EOF
[
{
"name": "Id",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "Name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "Weight",
"type": "INTEGER",
"mode": "REQUIRED"
}
]
EOF
}
}
provider "google" {
default_labels = {
l1 = "hello_again"
}
}
Both source_uris
and source_uri_prefix
seems correct. I don't create table by Terraform in the script, but it should not matter. Table is created successfully. I can query underlying CSV data. Issue only happens when table labels updated (in-place update)
Are you trying to use an example
folder inside your bucket(sh-unmanaged-tests-bucket
)?
@ggtisc yes. I create a bucket sh-unmanaged-tests-bucket
, a folder example/day=2024-01-30
and a csv document inside of it. Please see my script to reproduce. If you start it in GCP CloudShell, you end up with mentioned error.
@rshamunov I saw it but it seems more like a naming error.
Example:
source_uris = [
"gs://${google_storage_bucket.bucket_19177.name}/example/*"
]
source_uri_prefix = "gs://${google_storage_bucket.bucket_19177.name}/example/{key1:STRING}"
@ggtisc, Could you explain what's the "terraform registry format" exactly in this case? Do you want me to replace actual bucket name to terraform variable? Why it should be different?
source_uris:
gs://${google_storage_bucket.bucket_19177.name}/example/*
gs://sh-unmanaged-tests-bucket/example/*
source_uri_prefix:
source_uri_prefix = "gs://${google_storage_bucket.bucket_19177.name}/example/{key1:STRING}"
source_uri_prefix = "gs://sh-unmanaged-tests-bucket/example/{day:DATE}"
Did you try to create the table, change table label and apply the configuration again?
yes @rshamunov and after testing many times everything is okay without errors.
Terraform registry is the official documentation site of the terraform resources. Did you open the link to check the correct format?
You need to open the link and realize the correct format that you need to follow to avoid this kind of errors.
Also I gave you an example of how these resources work correctly.
I've applied configuration almost exactly as you provided (changed bucket name and project name):
resource "google_bigquery_dataset" "bq_ds_19177" {
dataset_id = "bq_ds_19177"
friendly_name = "BQ DS 19177"
description = "something"
location = "EU"
default_table_expiration_ms = 3600000
project = "sh-unmanaged-tests"
labels = {
env = "something"
}
}
resource "google_storage_bucket" "bucket_19177" {
name = "bucket-19177-777"
location = "US"
}
resource "google_bigquery_table" "bq_table_19177" {
dataset_id = google_bigquery_dataset.bq_ds_19177.dataset_id
project = "sh-unmanaged-tests"
table_id = "bq-table-19177"
deletion_protection = false
labels = {
l1 = "hello_again"
}
external_data_configuration {
autodetect = false
source_format = "CSV"
source_uris = [
"gs://${google_storage_bucket.bucket_19177.name}/*"
]
max_bad_records = 0
ignore_unknown_values = false
csv_options {
field_delimiter = ","
skip_leading_rows = 1
quote = "\""
}
hive_partitioning_options {
mode = "CUSTOM"
source_uri_prefix = "gs://${google_storage_bucket.bucket_19177.name}/{key1:STRING}"
require_partition_filter = false
}
schema = <<EOF
[
{
"name": "Id",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "Name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "Weight",
"type": "INTEGER",
"mode": "REQUIRED"
}
]
EOF
}
}
As expected it applied successfully. I changed a label l1 ("hello" to "hello_again") and applied it again. It failed with the same error.
google_bigquery_table.bq_table_19177: Modifying... [id=projects/sh-unmanaged-tests/datasets/bq_ds_19177/tables/bq-table-19177]
╷
│ Error: googleapi: Error 400: When defining a table with an ExternalDataConfiguration, a schema must be present on either the Table or the ExternalDataConfiguration. If the schema is present on both, the schemas must be the same., invalid
│
│ with google_bigquery_table.bq_table_19177,
│ on bq.tf line 19, in resource "google_bigquery_table" "bq_table_19177":
│ 19: resource "google_bigquery_table" "bq_table_19177" {
│
Confirmed issue!
When google_bigquery_table.labels
changes and trying to run a terraform apply it returns the shared error
Thanks a lot for the script. Was able to repro. I saw that schema
was present in both Table and ExternalDataConfiguration in the Update API request but the value was different, thereby failing the validation. I forwarded the internal version of this issue to the API team to get more insight.
Hi @rshamunov, here is the guidance from our API team:
So what happened here is that the table creation was done with schema specified on ExternalDataConfiguration, but as part of the table creation, BQ will store the schema in the main Table message. This causes the future update to fail.
We recommend just using the schema in Table instead - this is the recommended path and avoids the previous problem.
Could you try specifying google_bigquery_table.schema
instead of google_bigquery_table.external_data_configuration.schema
and see if it resolves your issue?
I tried to define a schema outside of external_data_configuration, in google_bigquery_table.schema
. But this brought another issue. Every terraform apply command re-creates the table, despite there are no actual changes:
terraform apply
google_bigquery_table.test-hived-table: Refreshing state... [id=projects/sh-unmanaged-tests/datasets/example_dataset/tables/test-hived-table-tf]
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement
Terraform will perform the following actions:
# google_bigquery_table.test-hived-table must be replaced
-/+ resource "google_bigquery_table" "test-hived-table" {
~ creation_time = 1725528080087 -> (known after apply)
~ etag = "JVoggigE1FbxTTSsLimbJw==" -> (known after apply)
~ expiration_time = 0 -> (known after apply)
~ id = "projects/sh-unmanaged-tests/datasets/example_dataset/tables/test-hived-table-tf" -> (known after apply)
~ last_modified_time = 1725528080152 -> (known after apply)
~ location = "US" -> (known after apply)
~ num_bytes = 0 -> (known after apply)
~ num_long_term_bytes = 0 -> (known after apply)
~ num_rows = 0 -> (known after apply)
- require_partition_filter = false -> null
- resource_tags = {} -> null
~ schema = jsonencode(
~ [
# (2 unchanged elements hidden)
{
mode = "REQUIRED"
name = "Weight"
type = "INTEGER"
},
- {
- mode = "NULLABLE"
- name = "day"
- type = "DATE"
},
] # forces replacement
)
~ self_link = "https://bigquery.googleapis.com/bigquery/v2/projects/sh-unmanaged-tests/datasets/example_dataset/tables/test-hived-table-tf" -> (known after apply)
~ type = "EXTERNAL" -> (known after apply)
# (8 unchanged attributes hidden)
~ external_data_configuration {
+ schema = (known after apply)
# (6 unchanged attributes hidden)
# (2 unchanged blocks hidden)
}
}
Plan: 1 to add, 0 to change, 1 to destroy.
Moreover, documentation for google_bigquery_dataset suggests to use google_bigquery_table.external_data_configuration.schema
If you use external_data_configuration [documented below](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_table#nested_external_data_configuration) and do not set external_data_configuration.connection_id, schemas must be specified with external_data_configuration.schema. Otherwise, schemas must be specified with this top-level field.
Thanks for the detailed follow-up. I believe the permadiff issue is the same as in https://github.com/hashicorp/terraform-provider-google/issues/12465. The provider-side fix is in discussion but unfortunately won't be available soon. Meanwhile The other issue has some workarounds that may unblock you for the time being.
For the resource documentation, I will check with our API team to see if that's no longer accurate.
I received some guidance from the API team. The documentation is correct. In terms of the two issues seen here 1) "Error 400: When defining a table with an ExternalDataConfiguration, a schema must be present on either the Table or the ExternalDataConfiguration...." when defining the schema in the external data config level and 2) terraform apply attempting to re-create the table when defining the schema in the table level, I will look into potential fixes on the Terraform resource implementation side.
The PR that closed this Github issue should fix the following problem when released in a new google
provider, (the version after the next, estimated to be v6.5.0
):
"Error 400: When defining a table with an ExternalDataConfiguration, a schema must be present on either the Table or the ExternalDataConfiguration...." when defining the schema in the external data config level
Once you upgrade the provider to that version, please continue to specify the schema inside external_data_configuration
like before. Feel free to update this issue here if the problem persists or you see other issues then.
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Community Note
Terraform Version & Provider Version(s)
Terraform v1.9.4 on linux_amd64
Affected Resource(s)
google_bigquery_table
Terraform Configuration
Debug Output
https://gist.github.com/rshamunov/8814cff7f2c6733061d92ca45ab36979#file-output-log
Expected Behavior
terrafrom apply completes in-place update of an external table
Actual Behavior
Error: googleapi: Error 400: When defining a table with an ExternalDataConfiguration, a schema must be present on either the Table or the ExternalDataConfiguration. If the schema is present on both, the schemas must be the same., invalid
Steps to reproduce
script to reproduce
Important Factoids
No response
References
No response
b/362528330