sassoftware / saspy

A Python interface module to the SAS System. It works with Linux, Windows, and Mainframe SAS as well as with SAS in Viya.
https://sassoftware.github.io/saspy
Other
372 stars 150 forks source link

Initial composable PROC TABULATE implementation #57

Closed jasonphillips closed 6 years ago

jasonphillips commented 7 years ago

(I'm dropping this in an issue for response rather working immediately towards a PR, because it might be more appropriate as an external add-on library rather than direct incorporation -- thoughts / reactions very welcome.)

I've finished an initial implementation of the PROC TABULATE procedure, opting for a highly composable syntax inspired somewhat by the query-abstraction tools of SQLAlchemy or pydal (to which I've also contributed a bit in the past). The reason for taking this approach is that our office (institutional research at a state university) very often produces deep-tabulated descriptive statistics, and is therefore a heavy user of the base TABULATE procedure on large datasets -- and while the macro language allows for some basic reuse of fragments and code, I've been looking to establish a higher level of abstraction and systematic composition of table structures.

The best overview I can give is this example notebook which shows off the current working syntax: https://github.com/jasonphillips/saspy/blob/proc-tabulate/SAS%2BTabulate.ipynb

That branch also contains the working code. In short, I add a tabulate key to datasets which contains functions for generating class / var / statistic variables, and a table function which can then be passed a composition of those elements, generating the full table syntax from the elements and also generating the class / var statements by assembling all the classes and vars used.

Anyhow, I built it by working backwards from "what syntax would I most enjoy as a user?", and its logic is already very solid, although I still need to implement further TABULATE options beyond the essential class / var / table statements. I'm curious to hear reactions!

tomweber-sas commented 7 years ago

@jasonphillips Cool, thanks. Yes, I'll look this over w/ Jared and see what input we have. Great to have you continuing to enhance saspy! Thanks, Tom

jld23 commented 7 years ago

@jasonphillips, Thanks for opening this issue and contributing to saspy! I'm looking at they syntax and thinking about your ideas. I'll try and get a complete response back to you next week.

Jared

jld23 commented 6 years ago

@jasonphillips looking at the example notebook, the style for calling the tabulate method is very different from the existing data table methods. I would have expected the syntax more like this:

cars.tabulate(classes=['origin','type'], 
              vars=['horsepower','cylinders'], 
              stats=['mean','n'], 
              table={'left':'by_origin * by_type', 
                     'top': '(horsepower | cylinders) * (mean | n)'
                    }
             )

I think this would force much of the .with_ syntax to be put in strings. A tabulate method would definitely be helpful, I'm not sure how to square the syntax differences.

Thoughts????

jasonphillips commented 6 years ago

So... that's kind of where I was wondering to what extent this may be more reasonable as an external extension of saspy rather than in core, because indeed the syntax I created is very different; but I believe the syntax needs to be different like this in order to gain any benefits from saspy over simply writing straight SAS code blocks, so that's where the dilemma will be.

Here's what I see as the main difference in approach to the more straightforward example above; probably sounding pedantic here but it helps to lay out my reasons for desiring a more systematic tool on top of TABULATE's purely convention-driven quirks:

In your example, the actual axis statements (top, left) are simply string arguments. Without a layer of abstraction on top accepting logical conjunctions (via operator overloading, basically) into query strings in a buildable way, any script or reporting tool that uses this library will end up with very messy code, essentially concatenating many different strings manually for different conditions, instead of expressing logical conjunctions.

Eg, with string arguments, a reporting program capable of building differently shaped tables would quickly run into a lot of this kind of conditional work:

top_query = 'origin '
used_classes = ['origin']
used_vars = ['enginesize']

# let's say show_horsepower_groups, show_cylinders & show_counts are options

if show_horsepower_groups==True:
  new_grouping = 'engine_type'
  used_classes.append("engine_type") # have to manually track all vars, classes used

  if (show_cylinders):
     new_grouping = new_grouping + ' | cylinders' # string concats
     used_classes.append('cylinders')

  top_query =  "%s | %s  * (horsepower * (%s))" \
     % (top_query, new_grouping,  show_counts and "mean | n" or "mean") # buggy string interpols. 
  used_vars.append("horsepower") 

cars.tabulate(classes=used_classes, vars=used_vars, top=top_query, ... #etc

...which gets unmanageable quickly when attempting to build more sophisticated shapes on the fly, or sharing code between functions in a more general reporting tool that might create many different table variations for differing audiences, add or remove levels at will, etc. The above, on my current syntax, would be logical expressions instead of string manipulations, to more cleanly combine many different options as needed:

top_query = by_origin

if show_horsepower_groups==True:
   new_grouping = show_cylinders and (engine_type | cylinders) or (engine_type)
   stats = show_counts and (mean | n) or (mean)  
   top_query = top_query | (new_grouping * stats)
   # all logical expressions above, which build an AST-like tree that will serialize into TAB syntax later

# also no need here to maintain lists of vars, classes used; built on the fly when you call table()

cars.tabulate.table(top=top_query, left=left_query)  #etc

The way I see it, TABULATE's syntax is akin using a highly idiosyncratic flavor of SQL, where building statements as strings on the fly is necessarily error prone in nature (even or especially if using the Macro language, which is basically nothing more than string interpolation all the way down); libraries like SQLAlchemy or PyDAL exist largely for this reason, to prevent anyone from falling into the terrible coding practice of constructing SQL statements by appending strings to strings at length, and in order to offer a logical syntax for expressing queries as statements, which then serializes into correct syntax when ready to execute.

So, I don't know -- I appreciate the look at it. I wonder if there's either: (1) a compromise syntax that would be composable / logical, yet closer to the function invocation you want; or (2) any merit to simply establishing a standardized way for external libraries like this to "extend" saspy as a tool.

jld23 commented 6 years ago

@jasonphillips, Thanks for your patience. Please open a pull request for this code to add the method to saspy. We appreciate your contribution and the thought you've put into the interface.

Jared

tomweber-sas commented 6 years ago

Closing out this issue, as this has been merged in! Thanks again @jasonphillips! Tom