sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
373 stars 150 forks source link

Compression with df2sd? #326

Closed DataRevelation closed 3 years ago

DataRevelation commented 4 years ago

I loaded a 14GB parquet file into python, then converted it into a SAS data set using df2sd, which results into a 130GB file. Is it possible to compress the data file within the df2sd function.

Also, the conversion takes more then 20 hours. Is this normal?

Thanks.

tomweber-sas commented 4 years ago

@DataRevelation , is the compression you refer to for the created SAS data set?, like compress= data set option? I thought I had a way to provide dataset options, but that's only for SASdata objects and some other methods. I don't happen to have a parameter for passing in output data set options on df2sd(). I can certainly add that so the output dataset options can be specified on the creation of the data set. That would allow compress= on the creation of the data set. If you're talking about compression during transfer of the pandas data to SAS, then no, I don't have any true compression going on there.

As for how long it takes, well, as I hope you would suspect, there's a lot of factors that go into that; there's no simple yes or no. For starters, the data in parquet was 14G, and you say the final SAS data set is 130G. What was the size of the Pandas data frame that you created from the parquet file, and then actually transferred to SAS? And, what were the data types and lengths? One thing here is that, for data transfer, I do actually only send character data the length of each value (varrying length char data), even though SAS stores Char data blank padded, fixed length. So, at least in that sense, I am compressing char data on the fly (not really, but not sending a ton of extra blanks to pad out fixed length columns). Network speeds obviously matter, and which access method you are using (STDIO, IOM, HTTP) can matter. Also, even what encoding SAS is running in. All char data in python is utf-8, so if SAS is running in an encoding other than that, then I have to transcode all data on the fly as I send it. So that takes time also. And, what version of saspy are you using. Over time, I've changes, enhanced, fixed df2sd, so it's even possible the latest could be better than an older version.

I haven't tried to load 130G of data, so I can't speak to how long it would take, which of course, wouldn't be a single number, but a range given the various possible combinations of things above.

I'm happy to take a look at these variables for this case and see what I think. Would a Web-ex, so we can look at this work for you? We can just post back and forth here to if not. The other thing I would look at first is the SASLOG from the df2sd to see what it shows, too. If you have the latest saspy, the lastlog() method now gives the complete section of the log for everything submitted for a given method; df2sd submits multiple blocks of code. Previously it only gave back the log for the last section of code run.

What version of saspy do you have. What access method are you using? Local or remote? And the specs on the data frame, size and datatypes/lengths would be the first things I would want to look at. And the log of course.

Thanks, Tom

DataRevelation commented 4 years ago

Yes, I was talking about create a compressed data set with "compress=" option. that will be a great feature to have.

Thanks for offering to work on this together. We use Teams at work. Can I send you a Teams invite? What is your email address, and preferred time? I am free all this afternoon, and most time during the weekend.

I am using 3.5.3, remote access, DataFrame has 150M rows, 100 columns, most are string types.

tomweber-sas commented 4 years ago

Hey, yeah, my email is on my account here; it's tom.weber@sas.com. Actually, can you email me and let me try to set it up for you to connect to teams? I always use web-ex but we've being pushed to switch to team for meetings and I'd like to try to see if I can do that w/ a external customer? We can do this now if you like.

tomweber-sas commented 3 years ago

@tomweber-sas I've built a new version with all of these changes - V3.6.0. It's out on pypi, so if you can do a pip install, you can get it that way:

pip uninstall -y saspy
pip install saspy

Though I'm not sure you can do that, but it's production, so maybe through that channel where you can get the prod version, now. Tom

tomweber-sas commented 3 years ago

@DataRevelation have you had a chance to try these changes out? How long did it take for the different steps (calculate char lengths, transfer data), when you do them separately? Anything else I can do with this?

Thanks! Tom

DataRevelation commented 3 years ago

Hi Tom,

I am trying to upgrade the package from GitHub. It will take a few days for my company to approve it. I will keep you posted.

Thanks,

Zhiqiang

On Thu, Nov 12, 2020 at 10:50 AM Tom Weber notifications@github.com wrote:

@DataRevelation https://github.com/DataRevelation have you had a chance to try these changes out? How long did it take for the different steps (calculate char lengths, transfer data), when you do them separately? Anything else I can do with this?

Thanks! Tom

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sassoftware/saspy/issues/326#issuecomment-726162936, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALIC3MQ7GLCYKA347DG6ABTSPP73LANCNFSM4SB2H46Q .

tomweber-sas commented 3 years ago

Sounds good, thanks!

tomweber-sas commented 3 years ago

Hey Zhiqiang, did you want to keep this open, or is it time to close it? Thanks! Tom

DataRevelation commented 3 years ago

Yes, Tom, I think it can be closed now. I will let you know if I found new issues. Thanks.

On Fri, Dec 4, 2020 at 4:28 PM Tom Weber notifications@github.com wrote:

Hey Zhiqiang, did you want to keep this open, or is it time to close it? Thanks! Tom

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sassoftware/saspy/issues/326#issuecomment-739030666, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALIC3MUP7VJATOSSI5QO54LSTFIBLANCNFSM4SB2H46Q .

tomweber-sas commented 3 years ago

Roger that. thanks, tom