sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
366 stars 149 forks source link

df2sd is getting stuck for large file #550

Closed rahulmahajan511 closed 9 months ago

rahulmahajan511 commented 11 months ago

Please see the below code, I am trying read a large sas dataset(>15gb) chunk wise, process each chunk and use df2sd to save the chunk and then combine it to main sas dataset. The logic works fine for small file but when i try to run it on my actual (>15GB) sas file, it work fine for initial 2 million records but then df2sd just hangs....It just keep waiting at that step, and i have to interrupt the script

When i interrupt the script the logs are also not created for that step..Any idea why this would be happening and way around it


sas = saspy.SASsession(cfgname='winlocal')
reader = pyreadstat.read_file_in_chunks(pyreadstat.read_sas7bdat,filename,chunksize=100000)
i=0
for df,meta in reader:
    print(df.dhape)
    df[datecol]=df[datecol].apply(pd.to_datetime,error='coerce')
    out=sas.saslib(libref='Out',path='C:\User\Desktop\Test')
    dct=sas.df_char_lengths(df)
    if i=1:
              #create the sas dataset with fname using first chunk
        result=sas.df2sd(df=df,table=fname,libref='Out',char_lengths=dct, outdsopts={'compress':'binary'},
        datetimes=DateField,
        outfmts=DateFormat)  
    else:
             #save the chunk as sas dataset with name temp
        result=sas.df2sd(df=df,table='temp',libref='Out',char_lengths=dct, outdsopts={'compress':'binary'},
        datetimes=DateField,
        outfmts=DateFormat)
             #combine the sas dataset temp with main sas dataset fname
        sas.symput(name='fname_py',value=fname)
        sascode = ''' OPTIONS COMPRESS = BINARY;
                      LIBNAME LIB 'C:\User\Desktop\Test';

                      DATA LIB.&fname_py;
                        SET LIB.fname_py LIB.temp;
                      RUN;
                  '''

print('End Process') 

SASsession Object Access Method = IOM SAS Config name = winlocal SAS Config file = C:\Users\nik\Desktop\DataTest\venv\Lib\site-packages\saspy\sascfg_personal.py WORK Path = C:\Users\nik\AppData\Local\Temp\SAS Temporary Files_TD9208LING-XPE1-D0010-PR\Prc2\ SAS Version = 9.04.01M7P08052020 SASPy Version = 5.2.2 Teach me SAS = False Batch = False Results = Pandas SAS Session Encoding = wlatin1 Python Encoding value = windows-1252 SAS process Pid value = 21300

tomweber-sas commented 11 months ago

ok, thanks, that's helpful. So, next is, where exactly does this 'hang'? What line and after how many iterations? Have you tried adding dome print statements for debugging? print(sas.lastlog()) after each df2sd() call to see the log and see if there were any problems? Print(dct) each time to see what it is? That's the one thing that jumps out at me with this code. saspy is calculating the character lengths independently for each chunk. It may be that these vary each time and that there could be some error in SAS when you are trying to combine them. It could, possibly even be that the dataframes aren't the same every chunk; the data types are being decided based upon the data. I don't know your data, so if one time a column is a number and another a string or datetime, that could throw a wrench in the works. I would add debugging to assess the df each iteration and see if it could be causing an error in SAS.

You could try writing each chunk to a different SAS table (add an increment to the temp name), and don't combine them. Then see if they all get written and see if they are compatible. Then try to combine them and see if there's some error or something.

Couple other thoughts only seeing source without any output, I don't know what DateFiled and DateFormat are, so I can't tell it there's anything off there. Again, looking at the log to see what happened in each step would be insightful. again, could be some chunk doesn't think one of these columns is a datetime or something.

Also, the reassigning of librefs for that C:\User\Desktop\Test over and over in each iteration isn't needed. That can be done once after you create the SASsession not in the loop; but that's not an actual problem.

Can you try looking at these things and see what you find? Provide me with the df.info() for each chunk, the dtc (char_lengths) for each chunk. and let's see if it's an issue with the various pieces not going together right in SAS.

Thanks, Tom

rahulmahajan511 commented 11 months ago

While I will look for different option which you mentioned, adding bit more details and finding which may give you more insight or help us find issue

 #save the chunk as sas dataset with name temp
        result=sas.df2sd(df=df,table='temp',libref='Out',char_lengths=dct, outdsopts={'compress':'binary'},
        datetimes=DateField,
        outfmts=DateFormat)

Is there a way to clear the logs or switch off the logging? As that might be efficient way to do it instead of opening and closing of sas session each time


reader = pyreadstat.read_file_in_chunks(pyreadstat.read_sas7bdat,filename,chunksize=100000)
i=0
for df,meta in reader:
        sas = saspy.SASsession(cfgname='winlocal')
    print(df.dhape)
    df[datecol]=df[datecol].apply(pd.to_datetime,error='coerce')
    out=sas.saslib(libref='Out',path='C:\User\Desktop\Test')
    dct=sas.df_char_lengths(df)
    if i=1:
              #create the sas dataset with fname using first chunk
        result=sas.df2sd(df=df,table=fname,libref='Out',char_lengths=dct, outdsopts={'compress':'binary'},
        datetimes=DateField,
        outfmts=DateFormat)  
    else:
             #save the chunk as sas dataset with name temp
        result=sas.df2sd(df=df,table='temp',libref='Out',char_lengths=dct, outdsopts={'compress':'binary'},
        datetimes=DateField,
        outfmts=DateFormat)
             #combine the sas dataset temp with main sas dataset fname
        sas.symput(name='fname_py',value=fname)
        sascode = ''' OPTIONS COMPRESS = BINARY;
                      LIBNAME LIB 'C:\User\Desktop\Test';

                      DATA LIB.&fname_py;
                        SET LIB.fname_py LIB.temp;
                      RUN;
                  '''
        sas._endsas()
print('End Process')
tomweber-sas commented 11 months ago

Ok, I've been trying this out to see what I see. Obviously, I have no idea what data you're using for input, although, that jumped out as a curiosity after I had to look up that code you're using to read whatever data you're using. It's a SAS data set! So, I assume you're trying to vet your code/logic, and you're using a SAS data set as input, to create an output SAS data set, then you can easily tell if your code/process/logic worked right? Comparing the input/output as 2 sas data sets is easy to see if it worked right? I hope that's the reason, as you're planning to use this for some other kind of input data?

The code you posted has syntax and logic errors, but I see what the point is, and I fixed those to run your code so I could start to do what I had asked you to do and provide those results, and see for myself what is happening. I also monitored the Python, Java, and SAS processes while the code ran to see if there were any issues with any of that. I didn't see any problems in that regard. I used a SAS data set that is large enough (5G) to see the variations in each iteration and see if those cause problems or not. I also saw what the general performance of the process was. There is one significant performance issue that can be addressed so it runs much faster. Though, for the code to really be correct, there's another change that will boost performance significantly too; but your process has to change to support that.

But first, a question. How did you determine that the process is 'hung'? Are you monitoring the 3 processes and saw they no longer were executing? What did you actually see? The guess about, something about logging, doesn't jump out to me as what the problem is. But, I've seen nothing but partial source code, instead of logs, errors and output showing what's going on. I'm not seeing anything keeping the overall process from running to completion.

So, that speaks to a few possibilities, off the top of my head.

First, is what I mentioned to begin with. You're creating multiple different dataframes through this process. Depending upon the input data, you can end up with these data frames being different in a number of ways, which can be causing downstream problems. First is that lengths for string columns with vary from one to the next. That does cause issues in SAS, depending upon options and code being executed when combining those. Until I see what your code thinks these are, I can only speculate. In my test, I had many of these where the character columns for each chunk were different, so combining them caused truncation, and warnings . Again, seeing the logs from your runs will provide necessary info. However, that didn't make the process fail; not in my case. The other thing is if the data types change in each data frame. That's possible too, when the data types are derived based upon the values in the set of rows you happen to have. That can cause errors in SAS, not just truncation warnings and such. Again, looking at the log and the dataframe metadata for each iteration will tell if that's a problem.

The other is that it's not hung. It's running and after a number of iterations, it's taking significantly longer. That's the first performance issue I saw and mentioned above. The 'combine' step is a performance problem. Each iteration, you read in the entire final table, write it out as a new table, along with the temp table you're trying to add. That gets slower and slower every iteration. You need to append the temp table to the final table instead of creating a new one from both. Proc Append is the answer to that performance problem. I wish the data step had an append option, so I could append to it directly instead of having to create a temp table first. But, it doesn't. One difference with append then your data step is that by default append will fail when column lengths don't match, while the data step just truncates and continues. You have to use the force option to get append to do the same. So, that make it much faster overall, but still begs the question of the problems of trying to combine data that doesn't match up correctly. This is where the bigger performance answer comes in, but you have to know what your data is to do that.

You need to have the one correct set of metadata for the final table and provide that to df2sd() each iteration, so it's the same. And, the data in the dataframe has to match, at least as far as data types, the char lengths can be less for any chunk, but the definition of the SAS data set needs to be the same, with the longest lengths for char columns so nothing is being truncated. That's something you would need to come up with at the beginning of this process. Trying to do this in chunk, means you have to know the data ahead of time to get the SAS metadata for the final table compatible with the various chunks.

There's a third possibility with this too. Maybe something is hung or something. One thing that's still a possibility is that the data in one of these data frames (you seem to imply it stops at the same point, data wise.) The IOM interface doesn't handle transcoding failures well. That package you're using to read the SAS data set as input, isn't a SAS product. There's no telling if what you're getting from that is correct or accurate. The data set I'm using seems to work; I haven't validates the output yet. But one data set I tried to use for input failed to be read by that package. Just got some error saying it couldn't read it. So, there's no telling if there could be bad data from that input which is causing a problem. Generally, IOM crashes on a transcoding failure, and I've handled that in saspy, so it comes back with a failure, not hanging, so that seems less likely than some other possibilities. But I need output to see what's happening on your side.

So, I know this was long, but looking forward to seeing what some of your information is with your runs.

Tom

rahulmahajan511 commented 11 months ago

I did try writing each chunk to a different SAS table (add an increment to the temp name) but it just wrote 21 chunk and 22nd chunk never gets written. I did few debugging by adding print statement as shown below to find where it actually is getting


    if i=1:
              #create the sas dataset with fname using first chunk
        result=sas.df2sd(df=df,table=fname,libref='Out',char_lengths=dct, outdsopts={'compress':'binary'},
        datetimes=DateField,
        outfmts=DateFormat) 
                print('step1:df2sd') 
                print(sas.lastlog())
    else:
             #save the chunk as sas dataset with name temp
        result=sas.df2sd(df=df,table='temp',libref='Out',char_lengths=dct, outdsopts={'compress':'binary'},
        datetimes=DateField,
        outfmts=DateFormat)
                print('step2:df2sd') 
                print(sas.lastlog())
         print('file written:',i)

Until 21 chunk everything work fine but at 22nd chunk it does print step2:df2sd and the log info but does not print file written:22. The last statement in log is % put I am working on getting the approval to share the log with with you, should provide you more details

tomweber-sas commented 11 months ago

Hey, you can email me all of that information instead of posting it here, that's no problem at all. I'm SAS, so it's all NDA, same as if you were providing information to Tech Support. Thanks, Tom

tomweber-sas commented 10 months ago

Do you still need help with this? Have you looked at any of the issue I mentioned with how you're trying to do this? Thanks, Tom

tomweber-sas commented 9 months ago

Hey, I'm going to close this since I have nothing else to go on and haven't seen any information with which to help. If you'd like to continue looking into this, just reopen it. I'm happy to help, but haven't gotten any information so far. Did you ever try skipping the first 20 or so that run and start with the one that has issue, to see if that works or if you can get a log or any kind of info or see if it still seems to hang? Tom