Allow writing to parquet files

wligtenberg commented 5 years ago

We want to export data from SAS into a spark cluster. Having this feature would allow us to use saspy to directly generate the parquet files. CSV has the downside that it loses type information.

The solution we would like is something akin to to_csv for the sas data object, but the to_parquet and it should write a valid parquet file, keeping as much of the type information as possible.

We have tried to use CSV, but the loss of type information is too much of an issue. We currently use an adapted version of sqoop and a special JDBC driver, to connect to the JDBC interface of SAS, but that seems to take longer than the CSV exports using saspy.

Having saspy support parquet output would give us the best of both worlds I think. As before, I am willing to help out, if I could get some pointers on how to approach this.

FriedEgg commented 5 years ago

To do so in SAS code requires licensing a product such as SAS/Access for Hadoop. This would let you define a libname statement using the hadoop engine and specify the DBCREATE_TABLE_OPTS option with a value "stored as parquet."

Assuming you do not have that product, and you want to use saspy in it's current state, you can get your SAS data set into a pandas data frame and then get your parquet file from there (pandas.DataFrame.to_parquet)

A future implementation could provide PySpark as an option instead of Pandas?

tomweber-sas commented 5 years ago

Correct, what @FriedEgg said. Right now you can do the following:

sasdata.to_df_CSV().to_parquet('filename')

Though if you have SAS/Access you can use sas.submit() to submit code to tell it to write it directly to hadoop as a parquet file. Though, I'm guessing you don't have that.

The question I'm wondering is what 'type information' are you loosing? SAS doesn't really have any types, just floating point double and fixed length byte array for characters. When I create a DF from a SD I query the formats to tell if a double represents some kind of date/time so I can convert those right. Is that the missing piece?

So, does sasdata.to_df_CSV().to_parquet() do what you want? If not, what's missing?

Thanks, Tom

wligtenberg commented 5 years ago

Thanks for the suggestion! I will look into it and report if it meets my needs.

tomweber-sas commented 5 years ago

cleaning up old issues. Hopefully the pandas to_parquet() was acceptable? If you still need anything, just let me know. I'll close this though as I'm not looking at trying to implement this directly.

Thanks, Tom

wligtenberg commented 5 years ago

I can verify that it works for me. Only downside is that I had to code a loop, to prevent OOM issues. That resulted in another interesting observation, the way to loop through tables in SAS is different depending on the storage back end... :( Hey, but it works now!

jld23 commented 5 years ago

Can you share an example of how you looped through and the differences between access engines?

Thanks!

From: wligtenberg notifications@github.com Sent: Monday, August 26, 2019 8:21 AM To: sassoftware/saspy Cc: Subscribed Subject: Re: [sassoftware/saspy] Allow writing to parquet files (#231)

I can verify that it works for me. Only downside is that I had to code a loop, to prevent OOM issues. That resulted in another interesting observation, the way to loop through tables in SAS is different depending on the storage back end... :( Hey, but it works now!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/sassoftware/saspy/issues/231?email_source=notifications&email_token=AAYDTSZK35F3KGATTTJZKU3QGPDDHA5CNFSM4HMC3XNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5EGLSA#issuecomment-524838344, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAYDTSZSFXJCC7CJFO3UEY3QGPDDHANCNFSM4HMC3XNA.

tomweber-sas commented 5 years ago

Are you saying that your python session doesn't have enough memory to handle having the data frame in memory and then creating the parquet file from it? If so, then yeah, that's all pandas and python. Assuming you were running that last posted suggestion:

sasdata.to_df_CSV().to_parquet('filename')

That will export the SAS data set to a CSV file on the server, download it to the client (unless you're on the same system, then it just uses the exported file). So far no real memory usage in python. Then it would call pandas read_csv to create the data frame, and then you would be doing to.parquet on that data frame to have pandas create the parquet file. Is that what you did? was it that read_csv can't create the data frame in your python session because it's too big for your memory?

Thanks, Tom

wligtenberg commented 5 years ago

Sorry for the delay, this escaped my attention for a while.

Indeed the issue is that pandas cannot create the data frame in memory. So that is why I decided to use loop over chunks of the data.

As requested here is an example of such a loop:

sas = saspy.SASsession(cfgname='iomwin')

sas.saslib(libref='MY_LIBNAME', engine='SPDE', options="ACCESS=READONLY", path='MY_PATH')
data = sas.sasdata(table = 'MY_TABLE', libref = 'MY_LIBNAME')
# determine nobs
total_rows = data.obs()
# create loop with sasdata dsopts to get blocks of data
n_rows_chunk = 10**8
n_loop = total_rows // n_rows_chunk + 1
for i in range(n_loop):
    print(i)
    # SPDE table so we need to use startobs and endobs!
    dsopts = {'startobs': (i*n_rows_chunk) + 1, 'endobs': ((i+1)*n_rows_chunk)}
    data = sas.sasdata(
        table = 'MY_TABLE',
        libref = 'MY_LIBNAME',
        dsopts = dsopts)
    tmp = data.to_df_CSV()
    tmp.to_parquet(fname = 'my_file_name_' + str(i) + '.parquet', compression = None)

For BASE tables I found out that you only need to change the dsopts line into:

dsopts = {'firstobs': (i*n_rows_chunk) + 1, 'obs': ((i+1)*n_rows_chunk)}

Only the names of the options are different, the calculations are the same.

tomweber-sas commented 5 years ago

yes, cool. That is a good way around it when python can't store the whole table. Thanks for showing that. One minor thing, that has not net effect, but cleans up the code just a little. dsopts is an attribute of the SASdata object, so in the loop, you can just change the attr instead of recreating the object each time. Again, no real big difference, just smaller code. This

    dsopts = {'startobs': (i*n_rows_chunk) + 1, 'endobs': ((i+1)*n_rows_chunk)}
    data = sas.sasdata(
        table = 'MY_TABLE',
        libref = 'MY_LIBNAME',
        dsopts = dsopts)

can be replaced with this:

    data.dsopts = {'startobs': (i*n_rows_chunk) + 1, 'endobs': ((i+1)*n_rows_chunk)}

Thanks again! Tom

sassoftware / saspy

Allow writing to parquet files #231