Closed AETDDraper closed 4 years ago
@AETDDraper Thanks for the question.
This sounds like an issue with automatic types conversion when loading data from a dataframe to a new table. When loading the data, does your app provide an explicit schema for the "problematic" columns?
Something along the following lines:
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField(name="column_foo", field_type="STRING"),
bigquery.SchemaField(name="column_bar", field_type="DATE"),
...
]
)
load_job = client.load_table_from_dataframe(my_dataframe, my_table_ref, job_config=job_config)
FWIW, the provided schema does not need to list all columns, just those for which the type cannot be automatically deduced, in which case the output can contain warning messages such as the following:
UserWarning: Unable to determine type of column 'foobar'.
Please let us know if an explicit schema solves the issue.
This is the error I am now getting.
Traceback (most recent call last):
File "/var/cache/tomcat/temp/interpreter-302728601229892001.tmp", line 89, in
Which BigQuery client version is this? It might be that your version does not support partial schemas, as support for this was only added recently. Specifying a full schema could help in this case.
BTW, would it be possible to provide a self-contained reproducible example of the code? That would be super useful, as right now I can only make educated guesses about the most probable cause.
Thanks in advance!
Peter are you a Google Employee then yes because my work has a DSA with you.
my version is 1.17.0
Not an employee, I'm a contractor currently working on the Python client libraries.
Upgraded the pip install to 1.2 version. Still not working I have tried formatting with pd.to_datetime not working still going to integer and tried it without aswell with a TIMESTAMP.
Could I share a CSV example with you with the dates?
@AETDDraper A reproducible code example would be even better, but if not possible, a sample data would also be useful. Thanks!
Can I send via personal email?
Then I can share both?
Thanks
David
On Mon, 23 Sep 2019 at 11:16, Peter Lamut notifications@github.com wrote:
@AETDDraper https://github.com/AETDDraper A reproducible code example would be even better, but if not possible, a sample data would also be useful. Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-python/issues/9255?email_source=notifications&email_token=AME7LQ6ETFX47LIE2YDGQ3DQLCJRTA5CNFSM4IYUWLJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7KMXVI#issuecomment-534039509, or mute the thread https://github.com/notifications/unsubscribe-auth/AME7LQ73Z46LWMKCEO4GWEDQLCJRTANCNFSM4IYUWLJA .
-- David Draper Data and Technology Development Officer E: ddraper@academiesenterprisetrust.org M: 07551 671649 “Things get done only if the data we gather can inform and inspire those in a position to make a difference.”
--
*Company and disclaimer information related to this email can be found here. https://sites.google.com/aetinet.org/academies-enterprise-trust/email-disclaimer
*
@AETDDraper Is the data sensitive in nature and should not be attached to this GitHub issue? Or by creating a GitHub gist? Only a handful of rows should probably suffice, since it's the columns and their types that matter.
If email is preferred, my address is "inbox AT MynameMysurname dotcom".
It's personal information so would conflict. I have actually got round it by converting to milli timestamp then converting back in BQ but surprised it's doing this.
On Mon, 23 Sep 2019 at 15:21, Peter Lamut notifications@github.com wrote:
@AETDDraper https://github.com/AETDDraper Is the data sensitive in nature and should not be attached to this GitHub issue? Or by creating a GitHub gist? Only a handful of rows should probably suffice, since it's the columns and their types that matter.
If email is preferred, my address is "inbox AT MynameMysurname dotcom".
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-python/issues/9255?email_source=notifications&email_token=AME7LQ7BN2RRI7CJRLTQL43QLDGHZA5CNFSM4IYUWLJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7LAVOI#issuecomment-534121145, or mute the thread https://github.com/notifications/unsubscribe-auth/AME7LQYJITCE3QWRWBIRVEDQLDGHZANCNFSM4IYUWLJA .
-- David Draper Data and Technology Development Officer E: ddraper@academiesenterprisetrust.org M: 07551 671649 “Things get done only if the data we gather can inform and inspire those in a position to make a difference.”
--
*Company and disclaimer information related to this email can be found here. https://sites.google.com/aetinet.org/academies-enterprise-trust/email-disclaimer
*
Good to hear that a workaround exists!
It would still be beneficial to other user, if this could be addressed in the library itself, so when the sample data is available, I'll happily take a look at it.
What's the best way to share with you?
Thanks
On Wed, 25 Sep 2019 at 12:03, Peter Lamut notifications@github.com wrote:
Good to hear that a workaround exists!
It would still be beneficial to other user, if this could be addressed in the library itself, so when the sample data is available, I'll happily take a look at it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-python/issues/9255?email_source=notifications&email_token=AME7LQ6HMOYCKP3GOCZ65CTQLNAORA5CNFSM4IYUWLJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7RP2PY#issuecomment-534969663, or mute the thread https://github.com/notifications/unsubscribe-auth/AME7LQ7L4EFUKVSO645B4FLQLNAORANCNFSM4IYUWLJA .
-- David Draper Data and Technology Development Officer E: ddraper@academiesenterprisetrust.org M: 07551 671649 “Things get done only if the data we gather can inform and inspire those in a position to make a difference.”
--
*Company and disclaimer information related to this email can be found here. https://sites.google.com/aetinet.org/academies-enterprise-trust/email-disclaimer
*
@AETDDraper If sharing publicly is not a desired option, then an email would work. My address is mentioned in one of the earlier comments.
Here is a sample CSV of the dates formatting incorrectly.
The Pandas code used to transform the columns from STRING to DATE but then going into an INT (TIMESTAMP MILLI) were: df['End_Date_Absence_M'] = pd.to_datetime(df['End Date:Absence'],infer_datetime_format=True) df['Start_Date_Absence_M'] = pd.to_datetime(df['Start Date:Absence'],infer_datetime_format=True)
On Thu, 26 Sep 2019 at 14:53, Peter Lamut notifications@github.com wrote:
@AETDDraper https://github.com/AETDDraper If sharing publicly is not a desired option, then an email would work. My address is mentioned in one of the earlier comments.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-python/issues/9255?email_source=notifications&email_token=AME7LQ5VX7OAERZSXLRW4ADQLS5ENA5CNFSM4IYUWLJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7VUHRQ#issuecomment-535512006, or mute the thread https://github.com/notifications/unsubscribe-auth/AME7LQ5TRGUTLRLRIQDGXR3QLS5ENANCNFSM4IYUWLJA .
-- David Draper Data and Technology Development Officer E: ddraper@academiesenterprisetrust.org M: 07551 671649 “Things get done only if the data we gather can inform and inspire those in a position to make a difference.”
--
*Company and disclaimer information related to this email can be found here. https://sites.google.com/aetinet.org/academies-enterprise-trust/email-disclaimer
*
@AETDDraper There seems to be something missing, GitHub might have removed the .csv
attachment, as it is not a supported file suffix. Will send you an email directly.
I was able to successfully load the data from a dataframe into a fresh table by providing an explicit schema for dataframe columns, and using the above transformation of the date columns (changed them inline instead of adding them to dataframe as new columns).
Posting a sample data file and a runnable Python script. Versions used:
Python 3.6
pandas 0.25.1
bigquery 1.20.0
The extra column row_order
is optional, I only added it to the script to make it easier to compare the resulting BigQuery table data with the source data.
@AETDDraper Does that help with resolving the reported issue?
Thanks Peter!
Regards David Draper AET Data and Technology Development Officer M: 07551 671649
On Tue, 1 Oct 2019, 12:35 Peter Lamut, notifications@github.com wrote:
I was able to successfully load the data from a dataframe into a fresh table by providing an explicit schema for dataframe columns, and using the above transformation of the date columns (changed them inline instead of adding them to dataframe as new columns).
Posting a sample data file https://github.com/googleapis/google-cloud-python/files/3675933/sample_data_issue_9255.txt a runnable Python script. Versions used:
Python 3.6 pandas 0.25.1 bigquery 1.20.0
Sample script
from google.cloud import bigqueryimport pandas
PROJECT = "my-project"DATASET = "my-dataset"TABLE = "my-table"CSV_FILENAME = "/path/to/file.csv"
client = bigquery.Client(PROJECT) dataset_ref = client.dataset(DATASET) table_ref = dataset_ref.table(TABLE)
start from scratchprint("Deleting table if exists...")
client.delete_table(table_ref, not_found_ok=True)
laod CSV data to dataframe and transform columns from string to date
df = pandas.read_csv(CSV_FILENAME) df["Start_Date_Absence"] = pandas.to_datetime( df["Start_Date_Absence"], infer_datetime_format=True ) df["End_Date_Absence"] = pandas.to_datetime( df["End_Date_Absence"], infer_datetime_format=True )
df["row_order"] = df.index # optional
configure the load job and run it
schema = [ bigquery.SchemaField("Start_Date_Absence", "DATE", mode="NULLABLE"), bigquery.SchemaField("End_Date_Absence", "DATE", mode="NULLABLE"), bigquery.SchemaField("row_order", "INTEGER", mode="NULLABLE"), ] job_config = bigquery.job.LoadJobConfig(schema=schema)
load_job = client.load_table_from_dataframe( dataframe=df, destination=table_ref, job_config=job_config, ) load_job.result()
The extra column row_order is optional, I only added it to the script to make it easier to compare the resulting BigQuery table data with the source data.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-python/issues/9255?email_source=notifications&email_token=AME7LQ5R7ZN7YEZB7D7SD43QMMYYZA5CNFSM4IYUWLJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAA6NHY#issuecomment-536995487, or mute the thread https://github.com/notifications/unsubscribe-auth/AME7LQ72QCQUQTB3RKRU6LLQMMYYZANCNFSM4IYUWLJA .
--
*Company and disclaimer information related to this email can be found here. https://sites.google.com/aetinet.org/academies-enterprise-trust/email-disclaimer
*
Closing the issue now, as the question has been answered, and no new issues have been reported.
So I am running a load job and the date format from the pythons DataFrame is turning into an integer. The date format is in UK Time. 01/01/1970 00:00:00
I have tried to transform it using pd.to_datetime(appended_data['DateOfBirth'],dayfirst=True,format='%d/%m/%Y %H:%M:%S', errors='coerce') which format it comes to: 1938-09-01 However bigquery will then put it in this integer 1012262400000
Any ideas on the best way to to this. Is it just to put in as a String in BQ but I have tried to do that with parse and format date and it's not working?