googleapis / google-cloud-python

Google Cloud Client Library for Python
https://googleapis.github.io/google-cloud-python/
Apache License 2.0
4.84k stars 1.53k forks source link

BigQuery: load_table_from_dataframe fails on datetime64 column used for partitioning, saying it's an INTEGER #9206

Closed simonvanderveldt closed 5 years ago

simonvanderveldt commented 5 years ago

Versions:

google-cloud-bigquery==1.19.0

We were initially using 1.18.0 but I noticed #9044 was included in 1.19.0 so we tried that as well, but it made no difference.

We're using a pandas dataframe read from parquet, example data would be

id status created_at execution_date
0 1 NEW 2018-09-12 09:24:31.291 2019-05-10 17:40:00
1 2 NEW 2018-09-12 09:26:45.890 2019-05-10 17:40:00
[628 rows x 4 columns]

df.dtypes shows:

id object
status object
created_at datetime64[ns]
dataplatform_execution_date datetime64[ns]
dtype: object

When trying to load this into BigQuery using load_table_from_dataframe() and setting the job config's time_partitioning to bigquery.table.TimePartitioning(field="execution_date") we get the following error:

The field specified for time partitioning can only be of type TIMESTAMP or DATE. The type found is: INTEGER.

Which doesn't really make sense, since the field is clearly a datetime64. The job config shown in the console looks correct (ie it's set to partitioned by day and it's using the correct field).

edit: It seems the cause for this is that Dataframe columns of type datetime64 are being converted to type INTEGER instead of DATE (or TIMESTAMP? I'm not sure which one would be the correct type in BigQuery).

edit2: Could it be this mapping is wrong and it should be DATE instead of DATETIME? https://github.com/googleapis/google-cloud-python/blob/dce13268e9bdf2df6a3c7a2df6db3b8a1bb51f75/bigquery/google/cloud/bigquery/_pandas_helpers.py#L55

HemangChothani commented 5 years ago

@simonvanderveldt Would you mind providing more details, a reproducible code sample, if possible? and any other info would be really helpful with investigating the cause, thanks!

simonvanderveldt commented 5 years ago

@HemangChothani Sure! I'll take some time tomorrow to create a reproducible case.

plamut commented 5 years ago

Thanks for the report @simonvanderveldt , I was able to reproduce the issue.

It can be fixed by providing an explicit schema:

job_config = bigquery.LoadJobConfig(
    time_partitioning=bigquery.table.TimePartitioning(field="execution_date"),
    schema=[
        bigquery.SchemaField(name="id", field_type="STRING"),
        bigquery.SchemaField(name="status", field_type="STRING"),
        bigquery.SchemaField(name="created_at", field_type="TIMESTAMP"),
        bigquery.SchemaField(name="execution_date", field_type="TIMESTAMP"),
    ]
)

FWIW, autodetecting schema for new tables has recently been scheduled for deprecation, since it turned out that autodetection is unreliable in too many cases. One can see the pending deprecation warning by switching on these warnings at the top of the script (they are disabled in Python < 3.7):

import warnings
warnings.simplefilter("always", category=PendingDeprecationWarning) 
warnings.simplefilter("always", category=DeprecationWarning) 

This reveals the following in the output, just prior to the exception traceback:

/home/peter/workspace/google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py:275: UserWarning: Unable to determine type of column 'id'.
  warnings.warn(u"Unable to determine type of column '{}'.".format(column))
/home/peter/workspace/google-cloud-python/reproduce/reproduce_9206.py:52: PendingDeprecationWarning: Schema could not be detected for all columns. Loading from a dataframe without a schema will be deprecated in the future, please provide a schema.
  load_job = client.load_table_from_dataframe(df, table_ref, job_config=job_config)
plamut commented 5 years ago

@simonvanderveldt Just checking, did an explicit schema solve the issue for you? Or is it necessary to investigate this further? Thanks.

simonvanderveldt commented 5 years ago

@plamut sorry, super busy with other stuff that's why I didn't get to this yet.

Using an explicit schema won't solve the issue for us since we're using load_table_from_dataframe() in a generic component that is used by our users. The users define the schema for the input data themselves so the component needs to be generic.

So we don't want to/can't provide a schema in the code. We're using parquet files as input which already contain all the required schema information, which after reading in the parquet file translates to the dataframe as well. Since a dataframe already has a schema which kind of "autodetecting schema for new tables" do you mean exactly? The mapping from dataframe/pandas/numpy types to BigQuery types?

P.S. We've switched the problematic (datetime) columns to datetime64[ns, UTC] after looking at https://github.com/googleapis/google-cloud-python/blob/dce13268e9bdf2df6a3c7a2df6db3b8a1bb51f75/bigquery/google/cloud/bigquery/_pandas_helpers.py#L54 which does get correctly mapped to TIMESTAMP instead of integer.

tswast commented 5 years ago

@simonvanderveldt Are there any object dtype columns (e.g. for strings or NULL values) in the DataFrame you're providing? If so, the mapping you linked to (https://github.com/googleapis/google-cloud-python/blob/dce13268e9bdf2df6a3c7a2df6db3b8a1bb51f75/bigquery/google/cloud/bigquery/_pandas_helpers.py#L55) does not get used. Instead, the pandas to_parquet method is used, which we have less control over (thus why we have deprecated that path).

@plamut I wonder if we can avoid to_parquet altogether if pyarrow is installed? This is the problematic line: https://github.com/googleapis/google-cloud-python/blob/dce13268e9bdf2df6a3c7a2df6db3b8a1bb51f75//bigquery/google/cloud/bigquery/_pandas_helpers.py#L276 Rather than return None when any column's type can't be determined, maybe we explicitly set the SchemaField's type to None and fallback to pyarrow's type conversions when we convert from pandas Series to pyarrow Array? https://github.com/googleapis/google-cloud-python/blob/dce13268e9bdf2df6a3c7a2df6db3b8a1bb51f75/bigquery/google/cloud/bigquery/_pandas_helpers.py#L187

simonvanderveldt commented 5 years ago

@tswast Yeah, we have two columns that are of dtype object (a list of all columns and their dtypes is in the description at the top), seems like you're right and that might be happening.

plamut commented 5 years ago

@tswast I am experimenting with that, but there are quite a few moving parts.

I did notice that pyarrow.Table types are correct (in a limited test case), and it might work to use that to fill in the missing types (if we send null values, the backend complains).

Update: Seems like this could work - pyarrow correctly detects strings and dates (did not test other types yet), and this info can be used to set a correct field_type on every SchemaField() instance.

The detail is probably in the detail, though, depending on how reliable pyarrow's detection is. But it seems that it would still bring an improvement to the current state.