googleapis / python-bigquery

Apache License 2.0
719 stars 289 forks source link

extract_table() yields 403 error on GCS bucket with retention policy #1913

Closed kansuke-at-trimble closed 2 months ago

kansuke-at-trimble commented 2 months ago

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Hello,

I have an issue with BigQuery Job client's extract_table() which I have hard time understanding what is going on. We have this code:

try:
        output_uri = os.path.join(dataset.uri, "*.parquet")

        extract_job_config = bq_job.ExtractJobConfig()
        extract_job_config.destination_format = bq_job.DestinationFormat.PARQUET
        extract_job_config.compression = bq_job.Compression.SNAPPY
        extract_job = client.extract_table(
            tmp_table_ref, output_uri, job_config=extract_job_config
        )
        extract_job.result() 

        log.info(
            "Exported %s.%s to %s",
            tmp_table_ref.dataset_id,
            tmp_table_ref.table_id,
            output_uri,
        )
finally:
        client.delete_table(tmp_table_ref)
)

and when it's executed the line extract_job.result() yields the following error (sensitive information is erased):

google.api_core.exceptions.Forbidden: 403 Access Denied: BigQuery BigQuery: Permission denied while writing data. 
Object 'bucket-a/train_dataset/000000000124.parquet' is subject to bucket's retention policy or object retention and cannot be deleted or overwritten until 2024-07-07T21:11:56.139071-07:00
Please make sure gs://bucket-a/train_dataset/000000000124.parquet is accessible via appropriate IAM roles, e.g. Storage Object Viewer or Storage Object Creator.

The bucket bucket-a has a retention policy enabled and the service account executing this task does have both Storage Object Viewer and Storage Object Creator roles. Also I can see bucket-a/train_dataset does contain parquet files, supposedly created by extract_job.result(). The part I don't get is that from the error message extract_job.result() somehow tries to either delete or modify which I cannot find the actual code that does it.

I would appreciate any help and comments on this issue. Thanks in advance!

Linchin commented 2 months ago

Thank you @kansuke-at-trimble for raising the issue! From what I can see, it seems to be working as intended. I suppose 'bucket-a/train_dataset/000000000124.parquet' already exists, and a retention policy indicates that

objects in the bucket can only be deleted or replaced once their age is greater than the retention period[1].

So basically the file cannot be overwritten during the retention period. I guess a way to bypass it is to use a unique file name, maybe one with UUID or the timestamp when it's generated.

Also, the client doesn't do the extract job itself - it sends the command to the backend, and the backend does the work. That's why there's no code in the client library doing the extraction. Hope these answer your question, and please let us know if you have any further questions :)

kansuke-at-trimble commented 2 months ago

Hi @Linchin, thanks for the help! It turns out that client.extract_table already created parquet files and the line extract_job.result() tried to do the exact same thing which caused the retention policy error. I deleted the extract_job.result() as it seems to be unnecessary and now the job runs correctly. Thanks again, and since it's solved on my end I am closing this issue