googleapis / python-bigquery

Apache License 2.0
746 stars 306 forks source link

Default expiration date is added to created table with a `TimePartitioning` when `expiration_ms` is explicitly none #741

Open Luttik opened 3 years ago

Luttik commented 3 years ago

Environment details

Steps to reproduce / Code example

When running the following code a table is where the value of Partition Expiration is silently set to the default expiration date defined on the dataset.

from google.cloud.bigquery import (
    Table,
    SchemaField,
    TimePartitioning,
    TimePartitioningType,
)

table = Table(
    f"project.dataset.test_table",
    schema=[
        SchemaField("test", "STRING", "NULLABLE"),
        SchemaField("date", "TIMESTAMP"),
    ]
)
table.time_partitioning = TimePartitioning(
    TimePartitioningType.DAY, field="date", expiration_ms=None,
)

# bq being an already configured instance of google.cloud.bigquery.Client
bq.create_table(table)

To start it is good to note that the way to remove an expiration date (fully remove, not set to default) from a partition via the standard SQL is as follows:

ALTER TABLE ropo.orders 
SET OPTIONS (partition_expiration_days=NULL)

So even though you explicitly seem to set the expiration_ms to "no expiration date" in the code it will still silently remove the table after time x (depending on the dataset). The field expiration_ms only accepts None and positive integers, therefore, when the dataset does have one by default, there is no way to create the table with a TimePartition but without an expiration_date at all. Values like 0, null, and NULL will all throw an error.

I would advise changing the behavior as follows:

Set the default_values of some variable that is given the name DEFAULT_PARTITION_EXPIRATION_FROM_DATASET such that this behavior is clear from the function definition. Don't make the value of the variable None such that we still have this variable available to explicitly turn the expiration functionality off.

Luttik commented 3 years ago

WHoops dont mind the draft

plamut commented 3 years ago

@Luttik Thanks for the report, I was able to reproduce the behavior.

Passing in None is interpreted as not specifying expiration at all, but the code should distinguish an explicit None from omitting the expiration period altogether (in which case the default is used). I'll check why this is the case.

plamut commented 3 years ago

Update: Partition expiration time is sent through the expirationMs field in the JSON request body, but the backend seems to treat explicit null values the same as if the field is omitted altogether. In both cases the new table inherits expiration time from the dataset, unfortunately.

Now, the client library should still make the distinction between None and "omitted", but the backend will likely need an update as well.

@Luttik As a workaround, you could set the expiration time to an insanely high value. I tried with sys.maxsize and the backend accepted it, meaning that the table will expire in ~292 million years. I suppose that's good enough for practical purposes? :laughing:

>>> sys.maxsize / 1000 / 86400 / 365 / 1e6
292.471208677536

@shollyman How difficult would it be to change this behavior? And how likely would that break some of the user's apps?

plamut commented 3 years ago

Hmm... I might have typed too soon. Setting the partition expiration date actually seems to work as expected (even if None), but it's the table expiration date that is inherited from the dataset on creation, even if table.expires is explicitly set to None by the client.

Setting table.expires to datetime.max should be a decent workaround for now, though.

table = Table(
    f"project.dataset.test_table",
    schema=[...],
)
table.expires = datetime.datetime.max  # <--- workaround!
table.time_partitioning = TimePartitioning(
    TimePartitioningType.DAY, field="date", expiration_ms=None,
)
tswast commented 3 years ago

There is a backend bug 181299401 for the ability to explicitly unset table expiration at creation time. Setting to max datetime does look like a decent workaround.

tswast commented 3 years ago

Related public issue: https://issuetracker.google.com/180993856

meredithslota commented 2 years ago

@tswast The related internal issue b/181299401 was closed as a "Won't Fix" with an identified workaround. Will that workaround work here or should we reopen the internal issue?

tswast commented 2 years ago

@meredithslota I was rather disappointed with the workaround, which was to call update_table after calling create_table. I think it's worth opening a new internal issue specifically for this feature.

tswast commented 2 years ago

Likely this will require coordination between backend and client teams, as it sounds like the backend does not have the ability to detect an explicit "null" on any API except "patch" (i.e. update_table)

meredithslota commented 2 years ago

I re-opened the same internal issue but converted it to a feature request.

stephaniewang526 commented 2 years ago

Created b/230739872 to track the backend work necessary to enable this new feature.