sassoftware / python-swat

The SAS Scripting Wrapper for Analytics Transfer (SWAT) package is the Python client to SAS Cloud Analytic Services (CAS). It allows users to execute CAS actions and process the results all from Python.
Other
148 stars 62 forks source link

read_excel can't read datetime values #116

Open JoannaNawalanySAS opened 3 years ago

JoannaNawalanySAS commented 3 years ago

read_excel from SWAT works different that read_excel from Pandas, wich makes me unable to read Date column as a Datetime value.

SWAT: storm_damage = conn.read_excel("D:\Workshop\VIYAPY\storm.xlsx", sheet_name="Storm_Damage", casout=dict(name="STORM_DAMAGE", replace=True, caslib="Public")) It reads Date column as Varchar

Pandas: sd = pd.read_excel("D:\Workshop\VIYAPY\storm.xlsx", sheet_name="Storm_Damage") It reads Date column as datetime64

I've tried explicity use converters and it is not helping:

SWAT: storm_damage = conn.read_excel("D:\Workshop\VIYAPY\storm.xlsx", sheet_name="Storm_Damage", casout=dict(name="STORM_DAMAGE", replace=True, caslib="Public"), converters={'Date': pd.to_datetime})

Pandas: sd = pd.read_excel("D:\Workshop\VIYAPY\storm.xlsx", sheet_name="Storm_Damage", converters={'Date': pd.to_datetime})

Anything I do I see the action run is: NOTE: 20: action table.loadTable / path='_f_c15dc565_7ff582b29b88.csv', readAhead=true, importOptions={fileType='csv', vars={Cost={type='double'}, Date={type='varchar'}, Event={type='varchar'}, Summary={type='varchar'}}, locale='EN-us'}, resident=true, promote=false, caslib='_LIB_C15DC565_7FF582B29B88', casOut={name='STORM_DAMAGE', caslib='Public', replace=true}, singlePass=false, noSource=true;

kesmit13 commented 3 years ago

There are some differences between the REST interface and the binary interface in CAS that prevented all of the extended data types from working, so some were normalized to strings. There are a couple of ways to work around it depending on what version of the server you are on and what protocol you are using.

If you are using the binary protocol, you can use the use_addtable=True option on the read_excel method. This uses a different technique to upload data which supports all data types.

If you have a more recent version of the CAS server which includes informats in the table.loadTable importoptions={vars=...} parameter, you can specify an importoptions= parameter to read_excel indicating the format of the date/time column. The informat values are described here https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a001239776.htm.

JoannaNawalanySAS commented 3 years ago

Thank you for your answer. I've tried your tips, but still I'm struggling with the problem.

As the action conn.builtins.history() shows importOptions I assume I have enough new version of CAS to use that parameter. I've tried like this (as I'm not sure if it is an CAS option or read_excel option)

  1. storm_damage = conn.read_excel("D:\Workshop\VIYAPY\storm.xlsx", sheet_name="Storm_Damage", casout=dict(name="STORM_DAMAGE", replace=True, caslib="Public", importoptions={'Date':'mmddyy10.'}))
  2. storm_damage = conn.read_excel("D:\Workshop\VIYAPY\storm.xlsx", sheet_name="Storm_Damage", casout=dict(name="STORM_DAMAGE", replace=True, caslib="Public"), importoptions={'Date':'mmddyy10.'})

Both didn't work.

I've tried to use use_addtable=True option, same, wrong, results: storm_damage = conn.read_excel("D:\Workshop\VIYAPY\storm.xlsx", sheet_name="Storm_Damage", casout=dict(name="STORM_DAMAGE", replace=True, caslib="Public"), use_addtable=True)

kesmit13 commented 3 years ago

I remembered working on a related issue recently (7eeac93332c5ed50c12004f1cc1f8fcb06f7b898). I forgot that you need to specify date_format on read_excel to match the informat specified in importoptions. However, that change hasn't gone into an official release yet. You could try copying the one file into you installation if possible until another release goes out.

JoannaNawalanySAS commented 3 years ago

I've copied the file you've mentioned to my server, but I'm note sure where exactly put those options.

Is the code below correct? storm_damage = conn.read_excel("D:\Workshop\VIYAPY\storm.xlsx", sheet_name="Storm_Damage", casout=dict(name="STORM_DAMAGE", replace=True, caslib="Public"), importoptions={'Date':'datetime'}, date_format={'Date':'mmddyy10.'})

kesmit13 commented 3 years ago

The date_format is passed to the to_csv method of the underlying pandas DataFrame, so it uses Python's date formats (see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html). If you are going for MMDDYY, then the Python format string would be '%m%d%y'.

JoannaNawalanySAS commented 3 years ago

How do I use the importoptions option in read_excel? It seems not working at all, even for string types, not only for dates. Do I have it set correctly here:

storm_damage = conn.read_excel("D:\Workshop\VIYAPY\storm.xlsx", sheet_name="Storm_Damage", casout=dict(name="STORM_DAMAGE", replace=True, caslib="Public"), importoptions={'Date':'datetime'}, date_format={'Date':'%m/%d/%Y.'})

kesmit13 commented 3 years ago

You are missing the vars= level: importoptions={'vars': {'Date': 'datetime'}}.

JoannaNawalanySAS commented 3 years ago

I've tried this (including specyfing cost as varchar to see if it is a problem with date values or maby with other types too):

storm_damage = conn.read_excel("D:\Workshop\VIYAPY\storm.xlsx", sheet_name="Storm_Damage", casout=dict(name="STORM_DAMAGE", replace=True, caslib="Public"), importoptions={'vars':{'Date': 'datetime', "Cost":"varchar"}, 'date_format':{'Date':'%m/%d/%Y.'}}) storm_damage.dtypes

These are the results: Event varchar Date varchar Summary varchar Cost double

There is still the same problem.

kesmit13 commented 3 years ago

I believe this is what you want.

importoptions={'vars':{'Date':{'informat': 'datetime', 'type':'datetime'}, 'Cost':{'type':'varchar'}}}
JoannaNawalanySAS commented 3 years ago

This is not working eather (I've inserted you sugestion and andded "date_format") :(

storm_damage = conn.read_excel("D:\Workshop\VIYAPY\storm.xlsx", sheet_name="Storm_Damage",
                               casout=dict(name="STORM_DAMAGE", replace=True, caslib="Public"), 
                               importoptions={'vars':{'Date':{'informat': 'datetime', 'type':'datetime', 'date_format':'%m/%d/%Y.'}, 'Cost':{'type':'varchar'}}})
storm_damage.dtypes

This is the result (even Cost is still a double): Event varchar Date varchar Summary varchar Cost double

kesmit13 commented 3 years ago

@JoannaNawalanySAS It doesn't appear to be working the way that I remember. However, there is a workaround for now. You can do this in two steps. First, read the excel file into a DataFrame, then upload the DataFrame. That's all that would be happening behind the scenes anyway.

import pandas as pd
data = pd.read_excel("D:\Workshop\VIYAPY\storm.xlsx", sheet_name="Storm_Damage")
storm_damage = conn.upload_frame(data,
     date_format='%d%b%Y:00:00:00',  # <= this indicates how dates should be written to the intermediate CSV
     importoptions={
        'vars': {
            'Date': {'informat': 'datetime',  # <= this indicates how CAS should read the column in the CSV
                         'type': 'datetime'},
            'Cost': {'type': 'varchar'}
        }
     }, casout=dict(...))

I'll have to look into the read_* methods to see how to make this work with them as well.

JoannaNawalanySAS commented 3 years ago

Thank you, I didn't thought about it as almost equally resources consuming solution. Importing to DataFrame and then uploading to CAS works fine!

Just one think: ERROR: Parameter 'date_format' is not recognized.