goccy / bigquery-emulator

BigQuery emulator server implemented in Go
MIT License
845 stars 108 forks source link

Error querying a nested struct using UNNEST #308

Open KevinRizzoTO opened 7 months ago

KevinRizzoTO commented 7 months ago

What happened?

I've been using this project locally to mock data from the Google Analytics BigQuery export schema. I've noticied one issue when using the following query:

bq --api http://0.0.0.0:9050 --project_id=test-project query "SELECT e.value.int_value FROM test-project.analytics_0000001.events_20240424 AS tp, UNNEST(tp.event_params) as e"
BigQuery error in query operation: Error processing job 'test-project:bqjob_r49c9de07d8bfce5_0000018f1ad943c3_1': failed to scan rows: failed to convert struct from array &{[0xc001a80e80 0xc001a80ec0 0xc001a80f00 0xc001a80f40]}

The problem seems to stem from how the event_params column is setup. It's an array of nested structs, outlined below in the python example script.

What did you expect to happen?

Expected the query to return a value, instead of an error.

How can we reproduce it (as minimally and precisely as possible)?

Use the following python script to reproduce (assuming you have a project called test-project and a dataset called analytics_0000001 already setup):

from google.cloud import bigquery
from google.api_core.client_options import ClientOptions

client = bigquery.Client(
    project="test-project",
    client_options=(
        ClientOptions(api_endpoint="http://localhost:9050")
    )
)

table_id = "test-project.analytics_0000001.events_1"
table = bigquery.Table(table_id)
table.schema = [
    bigquery.SchemaField(
        "event_params",
        "RECORD",
        mode="REPEATED",
        fields=[
            bigquery.SchemaField("key", "STRING", mode="NULLABLE"),
            bigquery.SchemaField(
                "value",
                "RECORD",
                mode="NULLABLE",
                fields=[
                    bigquery.SchemaField(
                        "string_value", "STRING", mode="NULLABLE"
                    ),
                    bigquery.SchemaField(
                        "int_value", "INTEGER", mode="NULLABLE"
                    ),
                    bigquery.SchemaField(
                        "float_value", "FLOAT", mode="NULLABLE"
                    ),
                    bigquery.SchemaField(
                        "double_value", "FLOAT", mode="NULLABLE"
                    ),
                ],
            ),
        ],
    ),
]
client.create_table(table, exists_ok=True)

# insert test row
rows_to_insert = [
    {
        "event_params": [
            {
                "key": "param1",
                "value": {"string_value": "value1"},
            },
            {
                "key": "param2",
                "value": {"int_value": 123},
            },
        ],
    }
]
errors = client.insert_rows_json(table_id, rows_to_insert)

# query for row using unnest
query = f"""
SELECT
    event_param.key AS param_key,
    event_param.value.string_value AS param_value_string,
    event_param.value.int_value AS param_value_int,
FROM
    `{table_id}`,
    UNNEST(event_params) AS event_param
"""
query_job = client.query(query)
results = query_job.result() # error occurs here
for row in results:
    print(row)

Anything else we need to know?

Switching the query above to this no longer has an error:

# query for row using unnest
query = f"""
SELECT
    event_param.key AS param_key,
    event_param.value AS param_value
FROM
    `{table_id}`,
    UNNEST(event_params) AS event_param
"""
query_job = client.query(query)
results = query_job.result() # error occurs here
for row in results:
    print(row)

Something about adding the extra nested value (event_param.value.string_value) results in the error.

KevinRizzoTO commented 6 months ago

I managed to track the error log to the go-zetasqlite library. It seems like, for some reason, the struct in the event_param.value record is being stored as an array. However, when trying to serialize the value back into a struct this was never supported. I've put a fork together with a change here that unblocks my current workflow. I'm not really sure if this is the correct fix, or maybe the real issue is how this server is passing data to the zetasqlite driver.