Dataframe with indices, not output\not overwriting correctly

sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.

https://sassoftware.github.io/saspy

Other

373 stars 150 forks source link

Dataframe with indices, not output\not overwriting correctly #370

Closed aaronsmith1234 closed 3 years ago

aaronsmith1234 commented 3 years ago

Describe the bug This is possibly 2 issues, but am I including them in one bug report since there might be a connection. I wrote a dataframe with a string row index from Python to a Solaris SAS server, and the index was not written; i.e. the resultant SAS dataset didn't have the index. I expected the index to be carried over as well.

After calling reset_index on my dataframe to put the row index into its own column, I then called the dataframe2sasdata again, expecting the dataset to now have the index column, but it did not. I also tried outputting the file to a different sas dataset, and then it worked fine and had the extra column as expected. I even tried deleting the sas dataset on the server, and then re-running the dataframe2sasdata again, but the index was still missing

To Reproduce Steps to reproduce the behavior:

Create a dataframe with a row index
Write it to your remote SAS machine
The row index should not exist (although I would expect it would be there)
Delete the file on the remote SAS machine
call reset_index on the dataframe locally
Resave the file as the same name to the remote SAS, and it will still not have the index
Resave the file as a different name to the remote SAS, and it will then have the index

Expected behavior

Expected that a dataframe with a row index would be written correctly to the remote SAS machine
Expected that overwriting the dataframe after changing the data to include the index (i.e. reset_index would result in a SAS dataset with the index column attached

Desktop (please complete the following information):

OS: Windows 10
SAS Version: 9.4m7
SASPy Version: 3.6.6

tomweber-sas commented 3 years ago

Hey @aaronsmith1234 I've tried what I believe you're describing on my side, but I see it doing what you are describing as expected. Maybe I'm missing something. Do you have an example of what you are seeing? Here's what I tried: 1) create a df 2) create another from it w/ one of the columns set as the index 3) write both to SAS (whole table from non-indexed, all but index from the indexed one) 4) reset the df w/ the index, so now it's back to all columns 5) create new data set, and overwrite the one created from the indexed df 6) both new and overwritten data sets have all columns

tom64-5> python3.5
Python 3.5.6 (default, Nov 16 2018, 15:50:39)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-23)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import saspy
>>> sas = saspy.SASsession(cfgname='iomj')
SAS Connection established. Subprocess id is 19643

No encoding value provided. Will try to determine the correct encoding.
Setting encoding to utf_8 based upon the SAS session encoding value of utf-8.

>>> sas
Access Method         = IOM
SAS Config name       = iomj
SAS Config file       = /opt/tom/github/saspy/saspy/sascfg_personal.py
WORK Path             = /sastmp/SAS_work792300001D19_tom64-5/SAS_work162900001D19_tom64-5/
SAS Version           = 9.04.01M4D11092016
SASPy Version         = 3.6.6
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = utf-8
Python Encoding value = utf_8
SAS process Pid value = 7449

>>> # use class dataset for examples
>>> sd = sas.sasdata('class','sashelp')

>>> #create df without an index, whole table
>>> df = sd.to_df()
>>> df
       Name Sex  Age  Height  Weight
0    Alfred   M   14    69.0   112.5
1     Alice   F   13    56.5    84.0
2   Barbara   F   13    65.3    98.0
3     Carol   F   14    62.8   102.5
4     Henry   M   14    63.5   102.5
5     James   M   12    57.3    83.0
6      Jane   F   12    59.8    84.5
7     Janet   F   15    62.5   112.5
8   Jeffrey   M   13    62.5    84.0
9      John   M   12    59.0    99.5
10    Joyce   F   11    51.3    50.5
11     Judy   F   14    64.3    90.0
12   Louise   F   12    56.3    77.0
13     Mary   F   15    66.5   112.0
14   Philip   M   16    72.0   150.0
15   Robert   M   12    64.8   128.0
16   Ronald   M   15    67.0   133.0
17   Thomas   M   11    57.5    85.0
18  William   M   15    66.5   112.0
>>> df.dtypes
Name       object
Sex        object
Age         int64
Height    float64
Weight    float64
dtype: object
>>> df.shape
(19, 5)

>>> #create df with an index, use Name columnm for the index
>>> dfi = df.set_index('Name')
>>> dfi
        Sex  Age  Height  Weight
Name
Alfred    M   14    69.0   112.5
Alice     F   13    56.5    84.0
Barbara   F   13    65.3    98.0
Carol     F   14    62.8   102.5
Henry     M   14    63.5   102.5
James     M   12    57.3    83.0
Jane      F   12    59.8    84.5
Janet     F   15    62.5   112.5
Jeffrey   M   13    62.5    84.0
John      M   12    59.0    99.5
Joyce     F   11    51.3    50.5
Judy      F   14    64.3    90.0
Louise    F   12    56.3    77.0
Mary      F   15    66.5   112.0
Philip    M   16    72.0   150.0
Robert    M   12    64.8   128.0
Ronald    M   15    67.0   133.0
Thomas    M   11    57.5    85.0
William   M   15    66.5   112.0

>>> # columns for df and dfi
>>> df.shape
(19, 5)
>>> dfi.shape
(19, 4)

>>> # create SAS data from df without an index, whole table
>>> sddf  = sas.df2sd(df,  'df')
>>> # create SAS data from dfi with the indec set (that column won't be in the SAS data set)
>>> sddfi = sas.df2sd(dfi, 'dfi')

>>> sddf.columnInfo()
    Member  Num Variable  Type  Len  Pos
0  WORK.DF    3      Age   Num    8    0
1  WORK.DF    4   Height   Num    8    8
2  WORK.DF    1     Name  Char    7   24
3  WORK.DF    2      Sex  Char    1   31
4  WORK.DF    5   Weight   Num    8   16
>>> sddfi.columnInfo()
     Member  Num Variable  Type  Len  Pos
0  WORK.DFI    2      Age   Num    8    0
1  WORK.DFI    3   Height   Num    8    8
2  WORK.DFI    1      Sex  Char    1   24
3  WORK.DFI    4   Weight   Num    8   16

>>> # reset the index and get back to the whole table again
>>> dfi2 = dfi.reset_index()
>>> dfi.shape
(19, 4)
>>> dfi2.shape
(19, 5)

>>> # create NEW SAS data from df with the reset index, whole table
>>> sddfi2 = sas.df2sd(dfi2, 'dfi2')
>>> sddfi2.columnInfo()
      Member  Num Variable  Type  Len  Pos
0  WORK.DFI2    3      Age   Num    8    0
1  WORK.DFI2    4   Height   Num    8    8
2  WORK.DFI2    1     Name  Char    7   24
3  WORK.DFI2    2      Sex  Char    1   31
4  WORK.DFI2    5   Weight   Num    8   16
>>>

>>> # remember the original SAS Data Set created from the indexed df missing Name column
>>> sddfi.columnInfo()
     Member  Num Variable  Type  Len  Pos
0  WORK.DFI    2      Age   Num    8    0
1  WORK.DFI    3   Height   Num    8    8
2  WORK.DFI    1      Sex  Char    1   24
3  WORK.DFI    4   Weight   Num    8   16
>>> sddfi
Libref  = WORK
Table   = dfi
Dsopts  = {}
Results = Pandas

>>> # Overwrite original table create from the indexed df, missing Name column with reset df and get whole table
>>> sddfir = sas.df2sd(dfi2, 'dfi')
>>> sddfir
Libref  = WORK
Table   = dfi
Dsopts  = {}
Results = Pandas

>>> sddfir.columnInfo()
     Member  Num Variable  Type  Len  Pos
0  WORK.DFI    3      Age   Num    8    0
1  WORK.DFI    4   Height   Num    8    8
2  WORK.DFI    1     Name  Char    7   24
3  WORK.DFI    2      Sex  Char    1   31
4  WORK.DFI    5   Weight   Num    8   16

>>> # Since this SASdata object references the same table, just replaced, it shows the new whole table now same as sddfir
>>> sddfi.columnInfo()
     Member  Num Variable  Type  Len  Pos
0  WORK.DFI    3      Age   Num    8    0
1  WORK.DFI    4   Height   Num    8    8
2  WORK.DFI    1     Name  Char    7   24
3  WORK.DFI    2      Sex  Char    1   31
4  WORK.DFI    5   Weight   Num    8   16
>>>

aaronsmith1234 commented 3 years ago

Sure @tomweber-sas , here you go!

Source data (SAS dataset on Remote SAS Server on Solaris). Note that this is dummy data I got from a SAS example somewhere, not real data that means anything.

SAS connection info. Note that I've blocked out a few things that might be considered "secrets"

#not sure if better way to set libraries, but this works
results=conn.submit("libname persfold '/parentfolder/childfolder/';")

data=conn.sasdata2dataframe("testdata","persfold")
type(data)
#this reads in the data as expected

agg_data=data.groupby("gender").agg("count")
conn.dataframe2sasdata(agg_data,"datafromPython","persfold",)

You can see here that the resultant dataset is missing the gender column that I indexed on.

but that in Python, the index column is there

Now, if I reset the index, the data in Python looks like this

agg_data=data.groupby("gender").agg("count").reset_index()
agg_data.head()

but, if I try and overwrite the original dataset using this

conn.dataframe2sasdata(agg_data,"datafromPython","persfold",)

I don't get an error, but it also doesn't overwrite the data as expected.

However, if I write it to a NEW dataset, then it does output as expected

conn.dataframe2sasdata(agg_data,"datafromPython2","persfold",)

Just in case its helpful, my pandas and ipython versions

pandas==1.2.3
ipython==7.21.0

tomweber-sas commented 3 years ago

I've tried this use case and it works too. What is this 'window' or whatever you're viewing the sasdata in. I think after you replace the table w/out the index, you need to refresh that window or make it reread the new table that replaced the table, because it does have the new column in it - or if there was some reason it didn't replace the table (like you have it open in whatever that window is, so it can't replace it), then after that step, issue: print(conn.lastlog()) to see what happend on that step; could be an error about not replacing the table cuz you have it open. You can use saslog() to see the entire log for the session, but lastlog() is nice right after a given method to just see the log from that method.

Tom

aaronsmith1234 commented 3 years ago

I was viewing the Sasdata in EnterpriseGuide. You are correct, that by having opened it in EnterpriseGuide (even if its not actively open), it results in a file lock, so the write fails.

Question, is there a reason NOT to just check every log message for "ERROR:" and throw an exception if that is found? Are there too many cases where that can be misleading\false positive? I would have expected that if SAS ran into an error, it would have surfaced that in Python, without having to search the log.

Also, were you able to replicate the problem from my example above? I could paste a complete code example, but will take a while to be able to put it all together.

tomweber-sas commented 3 years ago

yeah, just blindly looking for 'ERROR:' in the log and assuming whatever all code I ran failed isn't a viable solution. That's why providing a nice lastlog() method, where you can see the just log from any method (like what you get from submit()) allows you to check for anything you might consider an error, or obviously a real error like in this case.

The last questions, about replicating the problem above, I'm not sure I understand. If you don't have that file locked, yourself, then yes, it rewrites that dataset with the column. That worked as expected, as I didn't have the data set locked, such that it wasn't rewritten. This is what I got trying that out:

>>> dfi = df.groupby('Sex').agg('count')
>>> dfi
     Name  Age  Height  Weight
Sex
F       9    9       9       9
M      10   10      10      10
>>> dfi.shape
(2, 4)

>>> sdi = sas.df2sd(dfi)
>>> sdi.columnInfo()
     Member  Num Variable Type  Len  Pos
0  WORK._DF    2      Age  Num    8    8
1  WORK._DF    3   Height  Num    8   16
2  WORK._DF    1     Name  Num    8    0
3  WORK._DF    4   Weight  Num    8   24
>>> sdi.head()
   Name  Age  Height  Weight
0     9    9       9       9
1    10   10      10      10

>>> dfir = df.groupby('Sex').agg('count').reset_index()
>>> dfir
  Sex  Name  Age  Height  Weight
0   F     9    9       9       9
1   M    10   10      10      10
>>> sdi
Libref  = WORK
Table   = _df
Dsopts  = {}
Results = Pandas

>>> sdir = sas.df2sd(dfir)
>>> sdi
Libref  = WORK
Table   = _df
Dsopts  = {}
Results = Pandas

>>> sdir
Libref  = WORK
Table   = _df
Dsopts  = {}
Results = Pandas

>>> sdi.head()
  Sex  Name  Age  Height  Weight
0   F     9    9       9       9
1   M    10   10      10      10
>>> sdir.head()
  Sex  Name  Age  Height  Weight
0   F     9    9       9       9
1   M    10   10      10      10
>>>

aaronsmith1234 commented 3 years ago

One idea: if errors detected, maybe write out a warning message that errors were detected in the log and encouraging them to review the log? Add a "suppress_error_warnings" for those who don't want to get warned? I think my confusion was that by default, I have no idea that I encountered errors without manually searching the log; I expected that if SAS ran into an error, it would follow typical Python behavior and throw an exception.

I'll provide a full code example shortly illustrating my other issue about indices not getting saved correctly.

aaronsmith1234 commented 3 years ago

Note that in example below, I am assuming you have a working SasConnection object named "conn" and a libname called "persfold", and have imported pandas as pd.

source_data=pd.read_csv("https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/contacts.csv")
conn.dataframe2sasdata(source_data,"source_data_no_index","persfold",)
print(source_data)
source_data.set_index("name",inplace=True)
print(source_data)
conn.dataframe2sasdata(source_data,"source_data_index","persfold",)

See screenshots below, where the "source_data_index" dataset in sas is missing the "name" column that we set as the index

Or, alternatively, read the data back into python and look at the columns

data_no_index=conn.sasdata2dataframe("source_data_no_index","persfold")
data_with_index=conn.sasdata2dataframe("source_data_index","persfold")
print(data_no_index.columns.to_list())
print(data_with_index.columns.to_list())

Results in

['id', 'location_id', 'organization_id', 'service_id', 'name', 'title', 'email', 'department']
['id', 'location_id', 'organization_id', 'service_id', 'title', 'email', 'department']

Where you can see the name column has now been dropped.

Does that make it clear? Happy to set up a call if that would be easier.

aaronsmith1234 commented 3 years ago

And I think you can see from your example, that when you write the dataframe to sas, it drops the sex column\index. I would have expected that to carry over to SAS, unless I explicitly told it to drop the index. Similar to how in pandas, by default it writes the indices, unless you tell it not to.

aaronsmith1234 commented 3 years ago

Also realized my columns example from above isn't quite right; need to use the .index method to show that the index has been dropped. My bad, still new to open source work and submitting good issues :) Can see from below, that both data_no_index and data_with_index have the same index, which does not match the original dataframe.

print(source_data.index) #note that this is after we set name==index
data_no_index=conn.sasdata2dataframe("source_data_no_index","persfold")
data_with_index=conn.sasdata2dataframe("source_data_index","persfold")
print(data_no_index.index)
print(data_with_index.index)

tomweber-sas commented 3 years ago

Hey @aaronsmith1234, sorry, I'm heads down on other stuff here today, but trying to work on this concurrently. So, I'm sorry if I may have miss something. Can I sum up what all things I think we're now talking about and that wat we can reference them specifically (not in any order).

1) data frame index doesn't come over to SAS a s a column

true, the row label (index) isn't an actual columns in the df, such that I would see it when iterating over columns
it can also be a multi level structure, and that can't work coming over as a column
the default is just a range (like obs number in SAS), so you don't want that being created as a column

2) data frame index doesn't come over to SAS as a index (not sure you said this, but seems another question)

nope, it won't. it's not a column of the table, and SAS indexes are independent files w/ hash mappings to real columns
it could be multi level and you may think that means creating multiple SAS indexes. But these aren't columns, so they can't be
you can create indexes via outdsopts={} to specify which columns you want indexes to be created for

3) I don't look for the word 'ERROR' in the log for every method and print() a message.

nope, mostly because that's, unfortunately, not a definitive way for me to know the step truly failed or not.
yes, I can see that it seems like I should add that to every method in all of saspy.
print()ing a message isn't a programmatic way to address this, so that's not a good solution anyway; ok in interactive session
having your python code programmatically search for ERROR or any other thing you expect/don't expect in the log via lastlog() give you a programmatic way to assess what happened in your code submission.

4) df2sd() doesn't overwrite an existing data set.

yes, it does, unless the file is locked by some other process, in which case you get an error in the log (back to number 3...)

I think these are the things being address in this issue? Correct, or any I missing any?

Now, which ones are still on the table?

I think 4 is done, but I'm not sure you're agreeing on that. I may be missing something in the thread about this one.

1 and 2 also I think 'done'. That's how it works and I don't see a valid way to change the behavior. Again, open for further debate. I am not a dataframe expert, so if there is a way that this can actually work, in all cases, where it makes sense, then I'm happy to look into it. Always looking to make saspy better, but from what I've seen, I'm not seeing this make sense. What pandas thinks is an index (row label) and how SAS uses indexes doesn't match up enough. Also, given this is a python program, and one line makes the 'index' be a real column(s), while another makes it be the row label, that seems easy to do what you want in your python code.

3, well, I can't call that 'done' in my mind, since this will always be a debate. When looking at one simple case, it's easy to just say 'look for ERROR in the log and throw an error (or message)'. But when looking at the overall picture, this isn't that clear or obvious. There can be errors in the log which don't mean that what was being done failed. Just printing a message isn't acceptable, as that's not a programmatically accessible solution. Also, failing, in the case where it's a false positive, isn't acceptable either. The user program (this is a program, written in python), can check the log for anything they want, which is a programmatically accessible solution. So I will tend to go back to that being the answer. All that being said, there are cases where I do add checks into my multi step methods to try to catch predictable failures. So that's always something that might need to be done in another case. Though that's not just looking for the word ERROR in the log anyway, so that a bit more specific, for the case in hand.

Again, I'm not trying to dismiss anything you're questioning here. Just trying to keep on track and be able to focus on specifics. So, which of these (and are there any others?), are we still looking at? Do you believe you see behavior other that what I said each of those cases above do? Or is it you'd like to see the behavior change?

Thanks! Tom

aaronsmith1234 commented 3 years ago

First, definitely appreciate all the help here. Super good stuff!

Agree that #4 is closed; I think it relates directly to #3, in that if error reporting were more consistent with typical Python error reporting, I would have caught #4 without having of ever mentioned it. Going a little deeper here, I think the typical workflow in SAS (at least from what I've seen) is something like the following:

Run some code
Check the log manually and look for errors (and if submitting lots of code at once, having to scroll through a lot of code to find any errors). I believe (but correct me if wrong here), that if you submit two blocks of code, the first of which is invalid and results in an error, and the second of which is valid, SAS will throw an error, but it will also continue to execute the valid code.
Rinse and repeat, fixing errors as necessary

In Python (and I'm new here, so take it with a grain of salt!), very different. Take this code as an example

ans = 1 / 0
print("thing")

If you submit this, "thing" never gets printed, since Python immediately stops when it hits an unhandled exception. So when I'm programming in Python, even using in SASpy, that's the mindset I come in with; if my program hits an error that I'm not explicitly handling, the program will immediately stop execution and tell me that I received an error. I'm not having to go read a log file or system trace on every command I run; the default is assuming it worked unless I see an error.

I think the point you are making (and please correct me if I misstate it!) is that since under the hood you are calling SAS, the workflow should follow a SAS workflow; run command, look at log (perhaps programmatically as you point out), continue on as needed. I can see your point, but I also think its a bit dangerous that unless I'm explicitly checking every SAS command I submit for an error, I could run my process, see no errors in the Python script, but in reality every single SAS command failed, and by default SASpy doesn't alert me to that.

I agree that there really isn't a clean solution that works for every use case. One solution, as you recommend, is for me to write a checking step after each command that I run. And yes, that would work, but it's also only solving the problem for me. Another solution would be to add a few options to the package as to how to handle errors; then users can decide what makes the most sense for them. Some are primarily running interactive and might be fine with just warnings; others might be running in prod and want exceptions thrown for every SAS error. I think its a nuanced point that doesn't have a clean answer, but I think giving a default that at least raises the issue for awareness so people can decide how they want to handle it is a prudent route, instead of letting it go unseen except with manual introspection.

A few lines from the Zen of Python came to mind as I was thinking of this, and think they relate to the discussion

....
Explicit is better than implicit.
....
Errors should never pass silently.
Unless explicitly silenced.

aaronsmith1234 commented 3 years ago

I think my issue with #1 and #2 (which I agree are separate issues), I think my concern is that you can pass a dataframe to SAS via SASpy, and immediately retrieve it from SAS, and not get the same result. Note that I'm using result as a very loose term here; while it would be ideal if you could pass a dataframe to SAS and then retrieve that dataframe from SAS with the exact same indices, columns, etc, that's probably not possible (although maybe it is possible via some type of metadata you could add to a dataset, specifying what the dataset is indexed by in Python? I'm not up enough on my SAS chops to be able to say what might be possible here or whether its worth the effort, just throwing out ideas!).

The specific problem at hand that I was experiencing, that of row indices, can be solved by reset_index() (which also correctly unpacks multilevel row indices, at least in the examples I've done); yes, I could add that to every line when I'm writing a dataframe to SAS, but the next person that comes along is likely going to have the same issue (unless people aren't using indices as much as they should, since they are fantastic 😁). Given we know there is a potential issue here, some type of warning to the user alerting them of the problem would be valuable. Pandas does this now; if you do things that technically work but can trip you up, it will log warnings to the log (which you can explicitly silence if desired). There is something to be said about being explicit about what you are doing, which would lead to use reset_index, but I think if going that route, it should also warn the user that if there is a non-default range index, it will get lost when passing the dataframe to dataframe2sas method. I'm going to do a little bit of experimentation on multilevel column indices, will post some additional thoughts on those below when I have some time

Re: pandas example, see below. It's letting the person now that they are potentially going to get into trouble, although not actually stopping execution.

import pandas as pd
source_data=pd.read_csv("https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/contacts.csv")
source_data["location_id"][0]=0

Python has a pretty good ecosystem built out around this, and if desired, users can ignore warnings, or can programmatically catch them, or turn them into exceptions, a la:

import pandas as pd
import warnings

warnings.simplefilter("error")
source_data = pd.read_csv(
    "https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/contacts.csv"
)

source_data["location_id"][0] = 0
print("never get here")

I think the biggest thing I was looking for is by default, warning the user that something may be going wrong here, or that they are potentially getting themselves into trouble (i.e. when copying a dataframe with a non-default range index to SAS).

aaronsmith1234 commented 3 years ago

And an example with multiindex on columns:

import pandas as pd
source_data=pd.read_csv("https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/contacts.csv")
source_data.head()
source_data["extraColumn1"]=1
source_data["extraColumn2"]=2
data_with_multi_row_and_column_indices=pd.pivot_table(source_data,values="location_id",index=["name","title"],columns=["extraColumn1","extraColumn2"])
conn.dataframe2sasdata(data_with_multi_row_and_column_indices,"multiindices","persfold",)

and a view of the data from EG: vs Python\Jupyter and a csv with the default options

I think that kind of gets to my point; with the default pandas methods, it outputs all the data by default, and I can recreate the original dataframe from the csv, as long as when I read it in I set the indices. But with the current SAS handling, by default, I can't recreate it, since it drops row indices. If I reset index and then save it to SAS, then I can recreate it, but its not consistent with the other to_* methods of pandas, where by default it will keep everything.

I haven't tried recreating the columns from the multiIndex example above, but I think it would be possible, just not enough time to try 😢

tomweber-sas commented 3 years ago

ok, that's a lot of stuff. So, 3 posts; let me try to address them. Sorry again, I'm up to my eyes in other stuff at the moment.

post 1 - mostly about number 3. No, I don't think that you have to look at the log after every method of saspy you use. That would be ridiculous, which is part of the reason I don't do that internally either :). This particular method, is one where the programmer should assess the outcome programmatically, or visually if you only ever run the code interactively. A note to stdout saying 'I saw the word ERROR in the log' isn't a solution, unless you actually really would never do anything in the code if you ran the whole program any way other than one step at a time interactively. As to

ans = 1 / 0
print("thing")

try not to pigeonhole python based upon observations from a higher level wrapper; there are multiple ways people can run python code, and again, that's one of my issues is about what may seem like a solution for one simple case doesn't solve all the cases, so a partial implementation isn't a good solution. When I run those two lines, the first gets an error and the second executes just fine:

>>> ans = 1 / 0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero
>>> print("thing")
thing
>>>

I'm guessing you're running in a jupyter notebook? And you have both those in one cell? Even in that, if you put them in sperate cells and run them both, you'd see the print succeed.

For the data creation methods, there are too many possibilities in SAS for me to make a definitive assessment of success or failure. I have to defer that to the user. It can be unsuccessful ever if ERROR isn't in the log. It can be successful, even with ERROR in the log. When using this, knowing others may be accessing the same dataset you're trying to create/replace/update/delete, you really need to validate what happened, as it can vary any time you run it. Lots of ways to do this programmatically, not just parsing the log.

For posts 2/3 which are about indexes, I'm gonna need to investigate this a little further. Maybe even prototype something to really see if it's even feasible. Again, one simple case may work, but I want to make sure it can be consistent and deterministic.

Tom

aaronsmith1234 commented 3 years ago

No worries; this isn't something that is stopping me, just trying to help improve things! Feel free to drop the conversation as well if its just taking up time and not leading to a productive outcome for you, I don't have any skin in the game.

I'd be interested in your recommendation on how people should be using SASpy and checking for errors. Checking for errors manually sometimes seems like not an ideal solution; if I'm including some piece of code in my program, I have a reason for including it, and would like to know if it errors out (even if its some type of error that I end up suppressing because its not a real error\exception).

With running Python, the methods which I use are:

Jupyter Notebook, where if you submit the block of example block of code in one cell, which I view as one "thought" or "intention", the second line never gets executed

Also, if you have the code separated across multiple cells, and you do a run all cells, the exception stops execution
Running a script, where again, exception stops execution
Importing a package, where again, execution stops upon an error

The only scenarios (and again, I'm decently new to Python, so there could easily be cases I am missing\not aware of) I can see where you can submit those two lines and have both execute are if you submit them independently; I believe your example is that, where you submit the first line, see it error out, and then submit the second. The Jupyter notebook example of different cells is the same in mind; in both cases, you are submitting two separate batches of code, seeing one error out, and then proceeding to submit the second. And to me, that's fine; you explicitly see an error, decide you don't care, and continue submitting code anyway.

I agree that just writing something to stdout isn't a super solid solution; but I'd also argue that leaving all error checking up to the end user isn't a solid solution either. I like the Pandas approach noted above; use warnings by default when possibly dangerous things are happening (which I think an error in the log would be an example of, where it may not necessarily be an actual error, but it is indicative that something that has the potential for trouble is happening), and leave it up to users if they want to disable them, treat them as exceptions, or ignore them entirely. Its at least giving people a breadcrumb\nudge that they can then act on.

Have other users not brought up any concerns about error reporting? I'd be interested to know what other users have done in regards to this.

tomweber-sas commented 3 years ago

Hey @aaronsmith1234 , I got a handle on what I've been heads down on, so I have some free time today. I investigated pandas indexes (I've never really looked into them much before), and I think I can see a way to be able to automatically process a single column index, creating a variable out of it when transferring the dataframe to SAS. I expect I can simply set the index after moving a dataset to pandas, if the data set had a single index defined. I'll have to prototype all of this to really see, but I think I can acquire all of the info I need and process the index like a column on output. This will be a fairly extensive enhancement; lots of code. But I think it should be able to work. I don't see multi indexes working; I don't see the info I need with those.

Currently, it only takes 2 lines of user code to accomplish this, really. reset_index(), df2sd(..., outdsopts={'index':'variable'}), set_index() back and your df is as was. And for the other direction: sd2df(), set_index(). So, it's very simple on the python side, which is why I've never looked at trying to implement all of this in my code before. Will easily be 100 lines of code and a lot of testing and validation.

As for adding a option to adding an option to check the log for error and print a message, I have some ideas on that too. Might be able to do that in a common place, such that it's not so pervasive. If you have some time today, I'm happy to get on a call and talk about all of this, This thread's getting awfully long and spreading out on many topics :) A conversation can get us farther quicker.

Thanks! Tom

tomweber-sas commented 3 years ago

Really enjoyed the conversation!

I've reworked my prototype of the check for ERROR: in the log and print a message and set an attribute (so you can programmatically check it), to use the warnings module, like you suggested. This does seem a better way to do this than printing, as you can control it in various ways. I'm not clear yet on how to programmatically check it. Do you have an example of that? I could leave my attribute in there if that's easier.

Seems the default is to only show a given warning the first time, though you can set that too. I've tried that, and setting it to throw error, specifically instead of print the warning, and that worked. So, I think I'll go this route for these new enhancements. I may go back and see if other messages would be better to work into this too. Going forward, new ones will work this way. Like the index case I'll see about adding, like we talked about. Great suggestion on your part!

>>> sas.list_tables('x')
/opt/tom/github/saspy/saspy/sasioiom.py:976: UserWarning: Noticed 'ERROR:' in LOG, you ought to take a look and see if there was a problem
  warnings.warn("Noticed 'ERROR:' in LOG, you ought to take a look and see if there was a problem")
[('3 BAD NAME', 'DATA'), ('A', 'DATA'), ('AF', 'DATA'), ('CAFEMENU', 'DATA'), ('CARS', 'DATA'), ('CARSFR', 'DATA'), ('CHARF', 'DATA'), ('PRDSALE', 'DATA'), ('PRDSALEFR', 'DATA'), ('PRDSALE_', 'DATA'), ('SCENSTATES', 'DATA'), ('SCOREDATA', 'DATA'), ('TOM', 'DATA'), ('TOM2', 'DATA'), ('TOM3', 'DATA'), ('V91', 'DATA'), ('_SUMMARY', 'DATA')]
>>> sas.list_tables('x')
[('3 BAD NAME', 'DATA'), ('A', 'DATA'), ('AF', 'DATA'), ('CAFEMENU', 'DATA'), ('CARS', 'DATA'), ('CARSFR', 'DATA'), ('CHARF', 'DATA'), ('PRDSALE', 'DATA'), ('PRDSALEFR', 'DATA'), ('PRDSALE_', 'DATA'), ('SCENSTATES', 'DATA'), ('SCOREDATA', 'DATA'), ('TOM', 'DATA'), ('TOM2', 'DATA'), ('TOM3', 'DATA'), ('V91', 'DATA'), ('_SUMMARY', 'DATA')]
>>> sas.list_tables('x')
[('3 BAD NAME', 'DATA'), ('A', 'DATA'), ('AF', 'DATA'), ('CAFEMENU', 'DATA'), ('CARS', 'DATA'), ('CARSFR', 'DATA'), ('CHARF', 'DATA'), ('PRDSALE', 'DATA'), ('PRDSALEFR', 'DATA'), ('PRDSALE_', 'DATA'), ('SCENSTATES', 'DATA'), ('SCOREDATA', 'DATA'), ('TOM', 'DATA'), ('TOM2', 'DATA'), ('TOM3', 'DATA'), ('V91', 'DATA'), ('_SUMMARY', 'DATA')]
>>>
>>>
>>>
>>> import warnings
>>> warnings.filterwarnings("always",module='saspy')
>>> sas.list_tables('x')
/opt/tom/github/saspy/saspy/sasioiom.py:976: UserWarning: Noticed 'ERROR:' in LOG, you ought to take a look and see if there was a problem
  warnings.warn("Noticed 'ERROR:' in LOG, you ought to take a look and see if there was a problem")
[('3 BAD NAME', 'DATA'), ('A', 'DATA'), ('AF', 'DATA'), ('CAFEMENU', 'DATA'), ('CARS', 'DATA'), ('CARSFR', 'DATA'), ('CHARF', 'DATA'), ('PRDSALE', 'DATA'), ('PRDSALEFR', 'DATA'), ('PRDSALE_', 'DATA'), ('SCENSTATES', 'DATA'), ('SCOREDATA', 'DATA'), ('TOM', 'DATA'), ('TOM2', 'DATA'), ('TOM3', 'DATA'), ('V91', 'DATA'), ('_SUMMARY', 'DATA')]
>>> sas.list_tables('x')
/opt/tom/github/saspy/saspy/sasioiom.py:976: UserWarning: Noticed 'ERROR:' in LOG, you ought to take a look and see if there was a problem
  warnings.warn("Noticed 'ERROR:' in LOG, you ought to take a look and see if there was a problem")
[('3 BAD NAME', 'DATA'), ('A', 'DATA'), ('AF', 'DATA'), ('CAFEMENU', 'DATA'), ('CARS', 'DATA'), ('CARSFR', 'DATA'), ('CHARF', 'DATA'), ('PRDSALE', 'DATA'), ('PRDSALEFR', 'DATA'), ('PRDSALE_', 'DATA'), ('SCENSTATES', 'DATA'), ('SCOREDATA', 'DATA'), ('TOM', 'DATA'), ('TOM2', 'DATA'), ('TOM3', 'DATA'), ('V91', 'DATA'), ('_SUMMARY', 'DATA')]
>>> sas.list_tables('x')
/opt/tom/github/saspy/saspy/sasioiom.py:976: UserWarning: Noticed 'ERROR:' in LOG, you ought to take a look and see if there was a problem
  warnings.warn("Noticed 'ERROR:' in LOG, you ought to take a look and see if there was a problem")
[('3 BAD NAME', 'DATA'), ('A', 'DATA'), ('AF', 'DATA'), ('CAFEMENU', 'DATA'), ('CARS', 'DATA'), ('CARSFR', 'DATA'), ('CHARF', 'DATA'), ('PRDSALE', 'DATA'), ('PRDSALEFR', 'DATA'), ('PRDSALE_', 'DATA'), ('SCENSTATES', 'DATA'), ('SCOREDATA', 'DATA'), ('TOM', 'DATA'), ('TOM2', 'DATA'), ('TOM3', 'DATA'), ('V91', 'DATA'), ('_SUMMARY', 'DATA')]
>>>
>>>
>>>
>>> warnings.filterwarnings("error",module='saspy')
>>> sas.list_tables('x')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/tom/github/saspy/saspy/sasbase.py", line 2064, in list_tables
    ll = self._io.submit(code, results='text')
  File "/opt/tom/github/saspy/saspy/sasioiom.py", line 976, in submit
    warnings.warn("Noticed 'ERROR:' in LOG, you ought to take a look and see if there was a problem")
UserWarning: Noticed 'ERROR:' in LOG, you ought to take a look and see if there was a problem
>>> sas.list_tables('x')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/tom/github/saspy/saspy/sasbase.py", line 2064, in list_tables
    ll = self._io.submit(code, results='text')
  File "/opt/tom/github/saspy/saspy/sasioiom.py", line 976, in submit
    warnings.warn("Noticed 'ERROR:' in LOG, you ought to take a look and see if there was a problem")
UserWarning: Noticed 'ERROR:' in LOG, you ought to take a look and see if there was a problem
>>>
>>>
>>> warnings.resetwarnings()

aaronsmith1234 commented 3 years ago

I enjoyed it as well, thanks for the time. Here is an example of what I meant; users can decide, based on a warning type, how they want it handled. There may be more extensive handling you can do, but haven't dug into it much more than that.

import warnings

warnings.warn("DeprecationWarningExample", DeprecationWarning)
warnings.simplefilter("error", RuntimeWarning)

warnings.warn(
    "Deprecation Warning is still just a warning, no error", DeprecationWarning
)
warnings.warn(
    "Runtime warning will kill the code based on filter above", RuntimeWarning
)
warnings.warn("This will never get executed due to error above", DeprecationWarning)

I also noticed that by using a context manager, you can suppress warnings, but give you a count of how many it generates. Couldn't find a way to get a count without doing the suppression though

import warnings

with warnings.catch_warnings(record=True) as warn_suppress:
    warnings.warn("DeprecationWarningExample", DeprecationWarning)
    warnings.warn("DeprecationWarningExample", DeprecationWarning)
    warnings.warn("DeprecationWarningExample", DeprecationWarning)
    print(len(warn_supress))

tomweber-sas commented 3 years ago

Well, I saw those statements in the doc, but can you use it in a sentence? :) Being able to set the filter and stuff is ok for whether you want to see these, or not, or just once, ...

But, for programmatically checking for it? How do you actually do this? Here's a simple case:

sas.check_error_log = False
df = sas.sd2df('PRDSALE', 'x')

if sas.check_error_log:
   if df is not None:
      print("Error in the log, but I got my dataframe")
   else:
      print("Didn't get a dataframe and there was an error in the log")
      print(sas.lastlog())
      raise

df.head()
# ...

# with warnings, how do you programmatically check?

df = sas.sd2df('PRDSALE', 'x')

if warnings.???:
   if df is not None:
      print("Error in the log, but I got my dataframe")
   else:
      print("Didn't get a dataframe and there was an error in the log")
      print(sas.lastlog())
      raise

df.head()
# ...

tomweber-sas commented 3 years ago

Hey @aaronsmith1234, I've been buried in other stuff here the past week or so, but I'm finally getting this out to github. I just pushed these changes for adding the check and message via warnings. So, they are at main now. You can pull and test out to see what you think. I also have some doc on this, which I will push out when I build a new release, but at the moment you can see it here https://github.com/sassoftware/saspy/blob/main/saspy/doc/source/advanced-topics.rst#automatic-checking-for-error-in-the-log-and-the-warnings-module It'll be in the real doc later when the new version is built.

If you have a chance, let me know what you think!

Thanks, Tom

aaronsmith1234 commented 3 years ago

Hey @tomweber-sas , I'm slammed with work, but if I get a chance I'll give it a shot and let you know!

tomweber-sas commented 3 years ago

No trouble. I know the feeling! :)

tomweber-sas commented 3 years ago

Hey, I added a warning for the case where the dataframe has other than a RangeIndex (the default):

>>> sd = sas.df2sd(df);sd.head()
    Make           Model   Type Origin DriveTrain   MSRP  Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0  Acura             MDX    SUV   Asia        All  36945    33337         3.5          6         265        17           23    4451        106     189
1  Acura  RSX Type S 2dr  Sedan   Asia      Front  23820    21761         2.0          4         200        24           31    2778        101     172
2  Acura         TSX 4dr  Sedan   Asia      Front  26990    24647         2.4          4         200        22           29    3230        105     183
3  Acura          TL 4dr  Sedan   Asia      Front  33195    30299         3.2          6         270        20           28    3575        108     186
4  Acura      3.5 RL 4dr  Sedan   Asia      Front  43755    39014         3.5          6         225        18           24    3880        115     197
>>>
>>> sd = sas.df2sd(df.set_index('Make'));sd.head()
/opt/tom/github/saspy/saspy/sasiohttp.py:1238: UserWarning: Note that Indexes are not transferred over as columns. Only actual coulmns are transferred
  warnings.warn("Note that Indexes are not transferred over as columns. Only actual coulmns are transferred")
            Model   Type Origin DriveTrain   MSRP  Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0             MDX    SUV   Asia        All  36945    33337         3.5          6         265        17           23    4451        106     189
1  RSX Type S 2dr  Sedan   Asia      Front  23820    21761         2.0          4         200        24           31    2778        101     172
2         TSX 4dr  Sedan   Asia      Front  26990    24647         2.4          4         200        22           29    3230        105     183
3          TL 4dr  Sedan   Asia      Front  33195    30299         3.2          6         270        20           28    3575        108     186
4      3.5 RL 4dr  Sedan   Asia      Front  43755    39014         3.5          6         225        18           24    3880        115     197
>>>
>>> sd = sas.df2sd(df.set_index('Make').reset_index());sd.head()
    Make           Model   Type Origin DriveTrain   MSRP  Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  Wheelbase  Length
0  Acura             MDX    SUV   Asia        All  36945    33337         3.5          6         265        17           23    4451        106     189
1  Acura  RSX Type S 2dr  Sedan   Asia      Front  23820    21761         2.0          4         200        24           31    2778        101     172
2  Acura         TSX 4dr  Sedan   Asia      Front  26990    24647         2.4          4         200        22           29    3230        105     183
3  Acura          TL 4dr  Sedan   Asia      Front  33195    30299         3.2          6         270        20           28    3575        108     186
4  Acura      3.5 RL 4dr  Sedan   Asia      Front  43755    39014         3.5          6         225        18           24    3880        115     197
>>>

Just pushed that out to main also. I'll also add a note into the df2sd() doc about this too, so it's apparent there to start with.

tomweber-sas commented 3 years ago

I've just built a new version, V3.6.7, with all of these enhancements, so it's the current production version now. If you're good with all of this, we can close out this ticket. Obviously, if you need anything else, just let me know.!

Thanks! Tom

aaronsmith1234 commented 3 years ago

Hey @tomweber-sas , thanks! Yeah I'd say close it. Appreciate the good work!