sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
373 stars 150 forks source link

Possible bad warning message #442

Closed biojerm closed 2 years ago

biojerm commented 2 years ago

Hi Tom,

Describe the bug When I write out a dataframe with an index type of Int64Index I get the following error message UserWarning: Note that Indexes are not transferred over as columns. Only actual coulmns are transferred

I think this warning message was added in issue #370

(also note there is a minor spelling error with the second "columns")

To Reproduce

import saspy

PACKAGE_DIR = os.path.dirname(os.path.abspath(__file__))
CFG_PATH = os.path.join(PACKAGE_DIR, "saspy_cfg.py")

sas = saspy.SASsession(cfgfile=CFG_PATH, cfgname="sas_u8")
print(sas)

data = pd.DataFrame(
    {
        "fruit": ["orange", "apple", "orange"],
        "county": ["Smith", "Green", "Orange"],
    }
)
print("index BEFORE selection")
print(data.index)  # Range index
print("index AFTER selection")
orange_data = data.loc[
    data["fruit"] == "orange", ["county"]
]  # select orange fruit

print(orange_data.index)  # Now a Int64Index

sas.saslib("home", path="/home/jlabarge")

sas.df2sd(orange_data, table="fruit", libref="home")
sas.endsas()

Expected behavior

I would prefer not to get the warning when outputting what I believe to be a 'normal' dataset. It makes it seem like something went wrong. Fun twist Int64 index is being deprecated for NumericIndex. https://pandas.pydata.org/docs/reference/api/pandas.Int64Index.html#pandas.Int64Index

I don't know what the 'right' solution here is. In #370 I kind of get what was happening where the contents of the indexes had meaning. But here seems like the indexes don't matter. I unfortunately don't know much about how the index types are chosen, so I can't say why the selection I am doing results in int64 type indexes. Nor if adding Int64Index and NumericIndex to the 'ignore' set would be sufficient, or result in other weird things. https://github.com/sassoftware/saspy/blob/79a4a9400e7f255c89aafdca36b21484796bda9c/saspy/sasiostdio.py#L1699

**Screenshots**
My output 

```Access Method         = STDIO
SAS Config name       = sas_u8
SAS Config file       = /scharp/devel/jlabarge/scripts/saspy_cfg.py
WORK Path             = /tmp/SAS_workA2AA00009D0E_statsrv/
SAS Version           = 9.04.01M2P07232014
SASPy Version         = 3.7.5
Teach me SAS          = False
Batch                 = False
Results               = Pandas
SAS Session Encoding  = utf-8
Python Encoding value = utf_8
SAS process Pid value = 40206

index BEFORE selection
RangeIndex(start=0, stop=3, step=1)
index AFTER selection
Int64Index([0, 2], dtype='int64')

21
22   libname home    '/home/jlabarge'  ;
NOTE: Libref HOME was successfully assigned as follows:
      Engine:        V9
      Physical Name: /home/jlabarge
23
/trials/lab/python/anaconda3-4.10.1/lib/python3.8/site-packages/saspy/sasiostdio.py:1700: UserWarning: Note that Indexes are not transferred over as columns. Only actual coulmns are transferred
  warnings.warn("Note that Indexes are not transferred over as columns. Only actual coulmns are transferred")

Desktop (please complete the following information):

tomweber-sas commented 2 years ago

Hey Jeremy, yeah, looking back at that issue, it was a long one with a number of different issues, and things done to try to address them. I'll try to speak to what I recall and maybe need to dig back into more of it if that's not enough.

The part about df2sd when the DF has a row label (index), is that the row label isn't transferred over as a column to the SAS Data Set, since it's not an actual column in the DF (What a DF index is and SAS Indexes, aren't the same thing). It's also then not created as a column, and doesn't get an SAS index created for it either. That was the 'expected' case from that issue for a DF w/ row labels. But, there are too many cases where that can't work even if I tried to do it, so trying to do it and catching all the cases that don't work was a way more complicated situation that keeping as is. It's also trivial on the python side to switch the index back to a real column (one line of code) and then do df2sd so you get that column (and you can add an option to get a SAS index created for that column in the same call if you want that), and then one line to put the index back on the DF. So that's how it works, and since that was seen as something the user wouldn't know happened (thinking they would get columns and indexes in SAS when they didn't do the one line on either side of df2sd), adding in that warning (not an error) was the compromise for that case. Oh, and I'll fix the typo, thanks; my bad!

The check I make for range index was to eliminate that warning for the 'normal' cases where there wasn't some index specifically created which the user thought should have shown up on the SAS Data Set. Any DF w/out an explicit row label still has a default one (like Obs number in SAS) which is just the row number (and happens to be Range Index), so I was trying to keep that warning from showing up except for obvious cases where there was some other data in that index that wasn't going to make it over to the SAS Data Set. Hope that all makes sense.

The Warning that is coming out is from the Warnings module, which was added also as part of that issue. I added that in response to that issue also, due to the lack of any kind of programmatic return code from SAS to be able to tell whether anything worked or not; SAS isn't a simple programming language where every 'call' returns a RC you can check to see if there was success or not. So I added code to check for the word ERROR in the log after every submission and used that Warnings module to add a note (or warning) about having seen an Error in the log, so that it can be checked for programmatically (by user python code), so they don't have to try to parse the log themselves trying to guess if something worked or failed.

The Warnings addition, can be user controlled, as to what level of Warnings (note, warning, error, ...) are processed. It can be turned off in a number of ways (turn it off for saspy or from certain modules, but not other uses). It can be used to test for these so you can programmatically check for these cases and do what you like. Here's my doc on this addition, though you can check out the full doc on the module itself since it's just a Python Standard Library Module.

Let me know if that addresses this or not, or whatever. It's kind of multiple different things that all overlap, so it may or may not be completely obvious.

Thanks, Tom

tomweber-sas commented 2 years ago

Just as a quick FYI, if you don't want any of these 'new' Warnings being issued; just have saspy as it was before that was implemented, you can simply add the following to your config file and never think about it again. You can also subset it for specific messages or categories, ..., but this will just make it like it wasn't added in:

import warnings
warnings.filterwarnings('ignore',module='saspy')

Not sure if that's the simple answer for you, but it may be if it's just that you don't want that message showing up. Tom

biojerm commented 2 years ago

Thanks for the tip about the warnings module and how to 'mute' warnings.

Ignoring the 'most normal' case of RangeIndex does seem to be the best approach here. Adding other index types to that ignore list is bound to give more conflicts than help. 'Knowing' what the user wants to do with the indexes is basically impossible, so this approach let's them choose to care or not.

Plus like you said all I need to do to 'solve' the error is add a .reset_index(drop=True) and now my data frame indexes are back to 'normal'.

Trying to balance the inherit differences of SAS and python/pandas is always going to be tricky.

tomweber-sas commented 2 years ago

First, I'm glad that addressed what you need, in a one time simple way; put that in on purpose :) ! Second, I appreciate you 'getting' the dilemma. Backward Compatibility is the other thing in play too, as well as what you mention. So having an easy way to 'go back' when something like this changes (Not breaking, but new, more, different ...) helps with that.

And, yes, 'Most normal case' is a compromise; try to not even have the message for most case if you can == No need to even set this.

And yes, SD != DF is always a compromise :) Thanks for getting that!

Tom