BigQuery: Which API date fromat to use for UK dates when loading data from a dataframe?

AETDDraper commented 5 years ago

So I am running a load job and the date format from the pythons DataFrame is turning into an integer. The date format is in UK Time. 01/01/1970 00:00:00

I have tried to transform it using pd.to_datetime(appended_data['DateOfBirth'],dayfirst=True,format='%d/%m/%Y %H:%M:%S', errors='coerce') which format it comes to: 1938-09-01 However bigquery will then put it in this integer 1012262400000

Any ideas on the best way to to this. Is it just to put in as a String in BQ but I have tried to do that with parse and format date and it's not working?

plamut commented 5 years ago

@AETDDraper Thanks for the question.

This sounds like an issue with automatic types conversion when loading data from a dataframe to a new table. When loading the data, does your app provide an explicit schema for the "problematic" columns?

Something along the following lines:

job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField(name="column_foo", field_type="STRING"),
        bigquery.SchemaField(name="column_bar", field_type="DATE"),
        ...
    ]
)

load_job = client.load_table_from_dataframe(my_dataframe, my_table_ref, job_config=job_config)

FWIW, the provided schema does not need to list all columns, just those for which the type cannot be automatically deduced, in which case the output can contain warning messages such as the following:

UserWarning: Unable to determine type of column 'foobar'.

Please let us know if an explicit schema solves the issue.

AETDDraper commented 5 years ago

This is the error I am now getting.

Traceback (most recent call last): File "/var/cache/tomcat/temp/interpreter-302728601229892001.tmp", line 89, in client.load_table_from_dataframe(learners_append, table_ref,job_config=job_config).result() File "/usr/local/lib/python3.6/site-packages/google/cloud/bigquery/client.py", line 1519, in load_table_from_dataframe dataframe, job_config.schema, tmppath File "/usr/local/lib/python3.6/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 226, in dataframe_to_parquet arrow_table = dataframe_to_arrow(dataframe, bq_schema) File "/usr/local/lib/python3.6/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 190, in dataframe_to_arrow "Number of columns in schema must match number of columns in dataframe." ValueError: Number of columns in schema must match number of columns in dataframe.

plamut commented 5 years ago

Which BigQuery client version is this? It might be that your version does not support partial schemas, as support for this was only added recently. Specifying a full schema could help in this case.

BTW, would it be possible to provide a self-contained reproducible example of the code? That would be super useful, as right now I can only make educated guesses about the most probable cause.

Thanks in advance!

AETDDraper commented 5 years ago

Peter are you a Google Employee then yes because my work has a DSA with you.

my version is 1.17.0

plamut commented 5 years ago

Not an employee, I'm a contractor currently working on the Python client libraries.

AETDDraper commented 5 years ago

Upgraded the pip install to 1.2 version. Still not working I have tried formatting with pd.to_datetime not working still going to integer and tried it without aswell with a TIMESTAMP.

Could I share a CSV example with you with the dates?

plamut commented 4 years ago

@AETDDraper A reproducible code example would be even better, but if not possible, a sample data would also be useful. Thanks!

AETDDraper commented 4 years ago

Can I send via personal email?

Then I can share both?

Thanks

David

On Mon, 23 Sep 2019 at 11:16, Peter Lamut notifications@github.com wrote:

@AETDDraper https://github.com/AETDDraper A reproducible code example would be even better, but if not possible, a sample data would also be useful. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-python/issues/9255?email_source=notifications&email_token=AME7LQ6ETFX47LIE2YDGQ3DQLCJRTA5CNFSM4IYUWLJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7KMXVI#issuecomment-534039509, or mute the thread https://github.com/notifications/unsubscribe-auth/AME7LQ73Z46LWMKCEO4GWEDQLCJRTANCNFSM4IYUWLJA .

-- David Draper Data and Technology Development Officer E: ddraper@academiesenterprisetrust.org M: 07551 671649 “Things get done only if the data we gather can inform and inspire those in a position to make a difference.”

--

*Company and disclaimer information related to this email can be found here. https://sites.google.com/aetinet.org/academies-enterprise-trust/email-disclaimer

*

plamut commented 4 years ago

@AETDDraper Is the data sensitive in nature and should not be attached to this GitHub issue? Or by creating a GitHub gist? Only a handful of rows should probably suffice, since it's the columns and their types that matter.

If email is preferred, my address is "inbox AT MynameMysurname dotcom".

AETDDraper commented 4 years ago

It's personal information so would conflict. I have actually got round it by converting to milli timestamp then converting back in BQ but surprised it's doing this.

On Mon, 23 Sep 2019 at 15:21, Peter Lamut notifications@github.com wrote:

@AETDDraper https://github.com/AETDDraper Is the data sensitive in nature and should not be attached to this GitHub issue? Or by creating a GitHub gist? Only a handful of rows should probably suffice, since it's the columns and their types that matter.

If email is preferred, my address is "inbox AT MynameMysurname dotcom".

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-python/issues/9255?email_source=notifications&email_token=AME7LQ7BN2RRI7CJRLTQL43QLDGHZA5CNFSM4IYUWLJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7LAVOI#issuecomment-534121145, or mute the thread https://github.com/notifications/unsubscribe-auth/AME7LQYJITCE3QWRWBIRVEDQLDGHZANCNFSM4IYUWLJA .

-- David Draper Data and Technology Development Officer E: ddraper@academiesenterprisetrust.org M: 07551 671649 “Things get done only if the data we gather can inform and inspire those in a position to make a difference.”

--

*Company and disclaimer information related to this email can be found here. https://sites.google.com/aetinet.org/academies-enterprise-trust/email-disclaimer

*

plamut commented 4 years ago

Good to hear that a workaround exists!

It would still be beneficial to other user, if this could be addressed in the library itself, so when the sample data is available, I'll happily take a look at it.

AETDDraper commented 4 years ago

What's the best way to share with you?

Thanks

On Wed, 25 Sep 2019 at 12:03, Peter Lamut notifications@github.com wrote:

Good to hear that a workaround exists!

It would still be beneficial to other user, if this could be addressed in the library itself, so when the sample data is available, I'll happily take a look at it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-python/issues/9255?email_source=notifications&email_token=AME7LQ6HMOYCKP3GOCZ65CTQLNAORA5CNFSM4IYUWLJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7RP2PY#issuecomment-534969663, or mute the thread https://github.com/notifications/unsubscribe-auth/AME7LQ7L4EFUKVSO645B4FLQLNAORANCNFSM4IYUWLJA .

-- David Draper Data and Technology Development Officer E: ddraper@academiesenterprisetrust.org M: 07551 671649 “Things get done only if the data we gather can inform and inspire those in a position to make a difference.”

--

*Company and disclaimer information related to this email can be found here. https://sites.google.com/aetinet.org/academies-enterprise-trust/email-disclaimer

*

plamut commented 4 years ago

@AETDDraper If sharing publicly is not a desired option, then an email would work. My address is mentioned in one of the earlier comments.

AETDDraper commented 4 years ago

Here is a sample CSV of the dates formatting incorrectly.

The Pandas code used to transform the columns from STRING to DATE but then going into an INT (TIMESTAMP MILLI) were: df['End_Date_Absence_M'] = pd.to_datetime(df['End Date:Absence'],infer_datetime_format=True) df['Start_Date_Absence_M'] = pd.to_datetime(df['Start Date:Absence'],infer_datetime_format=True)

On Thu, 26 Sep 2019 at 14:53, Peter Lamut notifications@github.com wrote:

@AETDDraper https://github.com/AETDDraper If sharing publicly is not a desired option, then an email would work. My address is mentioned in one of the earlier comments.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-python/issues/9255?email_source=notifications&email_token=AME7LQ5VX7OAERZSXLRW4ADQLS5ENA5CNFSM4IYUWLJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7VUHRQ#issuecomment-535512006, or mute the thread https://github.com/notifications/unsubscribe-auth/AME7LQ5TRGUTLRLRIQDGXR3QLS5ENANCNFSM4IYUWLJA .

-- David Draper Data and Technology Development Officer E: ddraper@academiesenterprisetrust.org M: 07551 671649 “Things get done only if the data we gather can inform and inspire those in a position to make a difference.”

--

*Company and disclaimer information related to this email can be found here. https://sites.google.com/aetinet.org/academies-enterprise-trust/email-disclaimer

*

plamut commented 4 years ago

@AETDDraper There seems to be something missing, GitHub might have removed the .csv attachment, as it is not a supported file suffix. Will send you an email directly.

plamut commented 4 years ago

I was able to successfully load the data from a dataframe into a fresh table by providing an explicit schema for dataframe columns, and using the above transformation of the date columns (changed them inline instead of adding them to dataframe as new columns).

Posting a sample data file and a runnable Python script. Versions used:

Python 3.6
pandas 0.25.1
bigquery 1.20.0

Sample script

```py from google.cloud import bigquery import pandas PROJECT = "my-project" DATASET = "my-dataset" TABLE = "my-table" CSV_FILENAME = "/path/to/file.csv" client = bigquery.Client(PROJECT) dataset_ref = client.dataset(DATASET) table_ref = dataset_ref.table(TABLE) # start from scratch print("Deleting table if exists...") client.delete_table(table_ref, not_found_ok=True) # laod CSV data to dataframe and transform columns from string to date df = pandas.read_csv(CSV_FILENAME) df["Start_Date_Absence"] = pandas.to_datetime( df["Start_Date_Absence"], infer_datetime_format=True ) df["End_Date_Absence"] = pandas.to_datetime( df["End_Date_Absence"], infer_datetime_format=True ) df["row_order"] = df.index # optional # configure the load job and run it schema = [ bigquery.SchemaField("Start_Date_Absence", "DATE", mode="NULLABLE"), bigquery.SchemaField("End_Date_Absence", "DATE", mode="NULLABLE"), bigquery.SchemaField("row_order", "INTEGER", mode="NULLABLE"), ] job_config = bigquery.job.LoadJobConfig(schema=schema) load_job = client.load_table_from_dataframe( dataframe=df, destination=table_ref, job_config=job_config, ) load_job.result() ```

The extra column row_order is optional, I only added it to the script to make it easier to compare the resulting BigQuery table data with the source data.

@AETDDraper Does that help with resolving the reported issue?

AETDDraper commented 4 years ago

Thanks Peter!

Regards David Draper AET Data and Technology Development Officer M: 07551 671649

On Tue, 1 Oct 2019, 12:35 Peter Lamut, notifications@github.com wrote:

I was able to successfully load the data from a dataframe into a fresh table by providing an explicit schema for dataframe columns, and using the above transformation of the date columns (changed them inline instead of adding them to dataframe as new columns).

Posting a sample data file https://github.com/googleapis/google-cloud-python/files/3675933/sample_data_issue_9255.txt a runnable Python script. Versions used:

Python 3.6 pandas 0.25.1 bigquery 1.20.0

Sample script

from google.cloud import bigqueryimport pandas

PROJECT = "my-project"DATASET = "my-dataset"TABLE = "my-table"CSV_FILENAME = "/path/to/file.csv"

client = bigquery.Client(PROJECT) dataset_ref = client.dataset(DATASET) table_ref = dataset_ref.table(TABLE)

start from scratchprint("Deleting table if exists...")

client.delete_table(table_ref, not_found_ok=True)

laod CSV data to dataframe and transform columns from string to date

df = pandas.read_csv(CSV_FILENAME) df["Start_Date_Absence"] = pandas.to_datetime( df["Start_Date_Absence"], infer_datetime_format=True ) df["End_Date_Absence"] = pandas.to_datetime( df["End_Date_Absence"], infer_datetime_format=True )

df["row_order"] = df.index # optional

configure the load job and run it

schema = [ bigquery.SchemaField("Start_Date_Absence", "DATE", mode="NULLABLE"), bigquery.SchemaField("End_Date_Absence", "DATE", mode="NULLABLE"), bigquery.SchemaField("row_order", "INTEGER", mode="NULLABLE"), ] job_config = bigquery.job.LoadJobConfig(schema=schema)

load_job = client.load_table_from_dataframe( dataframe=df, destination=table_ref, job_config=job_config, ) load_job.result()

The extra column row_order is optional, I only added it to the script to make it easier to compare the resulting BigQuery table data with the source data.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googleapis/google-cloud-python/issues/9255?email_source=notifications&email_token=AME7LQ5R7ZN7YEZB7D7SD43QMMYYZA5CNFSM4IYUWLJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAA6NHY#issuecomment-536995487, or mute the thread https://github.com/notifications/unsubscribe-auth/AME7LQ72QCQUQTB3RKRU6LLQMMYYZANCNFSM4IYUWLJA .

--

*Company and disclaimer information related to this email can be found here. https://sites.google.com/aetinet.org/academies-enterprise-trust/email-disclaimer

*

plamut commented 4 years ago

Closing the issue now, as the question has been answered, and no new issues have been reported.

googleapis / google-cloud-python

BigQuery: Which API date fromat to use for UK dates when loading data from a dataframe? #9255

start from scratchprint("Deleting table if exists...")

laod CSV data to dataframe and transform columns from string to date

configure the load job and run it