Closed simonvanderveldt closed 5 years ago
@simonvanderveldt Would you mind providing more details, a reproducible code sample, if possible? and any other info would be really helpful with investigating the cause, thanks!
@HemangChothani Sure! I'll take some time tomorrow to create a reproducible case.
Thanks for the report @simonvanderveldt , I was able to reproduce the issue.
It can be fixed by providing an explicit schema:
job_config = bigquery.LoadJobConfig(
time_partitioning=bigquery.table.TimePartitioning(field="execution_date"),
schema=[
bigquery.SchemaField(name="id", field_type="STRING"),
bigquery.SchemaField(name="status", field_type="STRING"),
bigquery.SchemaField(name="created_at", field_type="TIMESTAMP"),
bigquery.SchemaField(name="execution_date", field_type="TIMESTAMP"),
]
)
FWIW, autodetecting schema for new tables has recently been scheduled for deprecation, since it turned out that autodetection is unreliable in too many cases. One can see the pending deprecation warning by switching on these warnings at the top of the script (they are disabled in Python < 3.7):
import warnings
warnings.simplefilter("always", category=PendingDeprecationWarning)
warnings.simplefilter("always", category=DeprecationWarning)
This reveals the following in the output, just prior to the exception traceback:
/home/peter/workspace/google-cloud-python/bigquery/google/cloud/bigquery/_pandas_helpers.py:275: UserWarning: Unable to determine type of column 'id'.
warnings.warn(u"Unable to determine type of column '{}'.".format(column))
/home/peter/workspace/google-cloud-python/reproduce/reproduce_9206.py:52: PendingDeprecationWarning: Schema could not be detected for all columns. Loading from a dataframe without a schema will be deprecated in the future, please provide a schema.
load_job = client.load_table_from_dataframe(df, table_ref, job_config=job_config)
@simonvanderveldt Just checking, did an explicit schema solve the issue for you? Or is it necessary to investigate this further? Thanks.
@plamut sorry, super busy with other stuff that's why I didn't get to this yet.
Using an explicit schema won't solve the issue for us since we're using load_table_from_dataframe()
in a generic component that is used by our users. The users define the schema for the input data themselves so the component needs to be generic.
So we don't want to/can't provide a schema in the code. We're using parquet files as input which already contain all the required schema information, which after reading in the parquet file translates to the dataframe as well. Since a dataframe already has a schema which kind of "autodetecting schema for new tables" do you mean exactly? The mapping from dataframe/pandas/numpy types to BigQuery types?
P.S. We've switched the problematic (datetime) columns to datetime64[ns, UTC]
after looking at https://github.com/googleapis/google-cloud-python/blob/dce13268e9bdf2df6a3c7a2df6db3b8a1bb51f75/bigquery/google/cloud/bigquery/_pandas_helpers.py#L54 which does get correctly mapped to TIMESTAMP
instead of integer.
@simonvanderveldt Are there any object
dtype columns (e.g. for strings or NULL values) in the DataFrame you're providing? If so, the mapping you linked to (https://github.com/googleapis/google-cloud-python/blob/dce13268e9bdf2df6a3c7a2df6db3b8a1bb51f75/bigquery/google/cloud/bigquery/_pandas_helpers.py#L55) does not get used. Instead, the pandas to_parquet
method is used, which we have less control over (thus why we have deprecated that path).
@plamut I wonder if we can avoid to_parquet
altogether if pyarrow
is installed? This is the problematic line:
https://github.com/googleapis/google-cloud-python/blob/dce13268e9bdf2df6a3c7a2df6db3b8a1bb51f75//bigquery/google/cloud/bigquery/_pandas_helpers.py#L276
Rather than return None
when any column's type can't be determined, maybe we explicitly set the SchemaField
's type to None
and fallback to pyarrow's type conversions when we convert from pandas Series to pyarrow Array?
https://github.com/googleapis/google-cloud-python/blob/dce13268e9bdf2df6a3c7a2df6db3b8a1bb51f75/bigquery/google/cloud/bigquery/_pandas_helpers.py#L187
@tswast Yeah, we have two columns that are of dtype object
(a list of all columns and their dtype
s is in the description at the top), seems like you're right and that might be happening.
@tswast I am experimenting with that, but there are quite a few moving parts.
I did notice that pyarrow.Table
types are correct (in a limited test case), and it might work to use that to fill in the missing types (if we send null values, the backend complains).
Update:
Seems like this could work - pyarrow
correctly detects strings and dates (did not test other types yet), and this info can be used to set a correct field_type
on every SchemaField()
instance.
The detail is probably in the detail, though, depending on how reliable pyarrow's detection is. But it seems that it would still bring an improvement to the current state.
Versions:
We were initially using 1.18.0 but I noticed #9044 was included in 1.19.0 so we tried that as well, but it made no difference.
We're using a pandas dataframe read from parquet, example data would be
df.dtypes
shows:When trying to load this into BigQuery using
load_table_from_dataframe()
and setting the job config'stime_partitioning
tobigquery.table.TimePartitioning(field="execution_date")
we get the following error:Which doesn't really make sense, since the field is clearly a
datetime64
. The job config shown in the console looks correct (ie it's set to partitioned by day and it's using the correct field).edit: It seems the cause for this is that Dataframe columns of type
datetime64
are being converted to typeINTEGER
instead ofDATE
(orTIMESTAMP
? I'm not sure which one would be the correct type in BigQuery).edit2: Could it be this mapping is wrong and it should be
DATE
instead ofDATETIME
? https://github.com/googleapis/google-cloud-python/blob/dce13268e9bdf2df6a3c7a2df6db3b8a1bb51f75/bigquery/google/cloud/bigquery/_pandas_helpers.py#L55