next-exp / IC

5 stars 72 forks source link

Another example of "I have a hammer, everything is a nail?" #592

Closed jjgomezcadenas closed 5 years ago

jjgomezcadenas commented 5 years ago

@jacg, please advise:

While discussing code (in KrCalib, not yet IC grade, work in progress) with @mmkekic the following issue came up. KrCalib is a set of functions that work with a DataFrame

df  = (['event', 'time', 's1_peak', 's2_peak', 'nS1', 'nS2', 'S1w', 'S1h',
       'S1e', 'S1t', 'S2w', 'S2h', 'S2e', 'S2q', 'S2t', 'Nsipm', 'DT', 'Z',
       'Zrms', 'X', 'Y', 'R', 'Phi', 'Xrms', 'Yrms'],
      dtype='object')

The goal of those functions is to produce correction maps, but my question here is more related with infatuations, hammers and nails.

Suppose that you want to write a function that takes a subset of the DF (say X,Y,S2e) and operates over it, for example taking also a matrix of corrections C (which are indexed by X,Y) and returning a vector of corrected energy which is the product of the S2e vector by the C matrix.

The obvious option is to use functions that take an instance of the DF, or slice the DF (or event expand the DF). Pandas provides machinery to do all that. For example:

def correct_energy(df : DataFrame, cf: DataFrame)->np.array:
    return df.S2e * cff.X][df.Y]

and then call:

E = correct_function(df[['X', 'Y', 'S2e']], cf)

So, actually I am only passing to correct_function() the relevant subset of the DF.

Clear enough, but looking at the type of the function I don't know exactly what is "df" (or "cf"), since they are very generic things (DataFrames).

So the infatuated (old) programmer defines a type thinking he adds to clarity

@dataclass
class XYS2df:
   x     : np.array
   y     : np.array
   s2e : np.array

Then creates a XYS2df object

xys2df = XYS2df(df.x.values, df.y.values, df.S2e.values)

Now, he thinks the function has gained in clarity because the type taken in input is very clear:

def correct_energy(df : XYS2df, cf: DataFrame)->np.array:
    return df.S2e * cff.X][df.Y]

and call:

E = correct_function(xys2df, cf)

The advantage, arguably is that by looking at the type of the object in the function signature I know what I am passing. The disadvantage is that creating the object xys2df will be less efficient than slicing the dstsframe. Also, I have lost the "power" (e.g, all the goodies) of the DF object.

@mmkekic raised the issue while discussing this afternoon. After passionately arguing that I was right I strongly suspect that she has uncover another hammer-and-nail example.

So here is the infatuated functional programmer humbly asking for your feedback.

jacg commented 5 years ago

Type annotations, while an interesting idea, really do not fit comfortably with the fundamenal typing philosophy of Python which is Duck Typing. They have been added to the language as an afterthought, and have gained little traction within the broad Python programming community, especially in the scientific Python community, where people tend to care even less about the code itself---as opposed to its output---than programmers in other domains do. Consequently, the expressivity of the type annotation tools available in Python is underdeveloped, especially when it comes to the tools that scientific programmers like to use.

Contrast this with, say, Haskell: there the type system is a central pillar of the language which has been carefully thought out before the language was built on top of it. This is in stark contrast to it having been bolted on as an afterthought in a neighbourhood where the majority just doesn't care, as is the case in Python. My point is that you should expect huge gaping holes in the utility and expressivity of Python type annotations (especially when compared to Haskell's type system), and such shortcomings will be especially obvious in scientific programming contexts.

Pandas dataframes are very powerful and capable objects, which have been designed with scientific computing in mind. In a scientific computing context, it is likely to be counterproductive to replace dataframes with some other less capable objects, for the sake of improving type hinting.

It's not so much that it looks like a nail, but rather it doesn't look at all like a nail, so you have replaced the \<whatever intricate object we choose for the metaphor this time> just so that you can whack it with your hammer. The result is a very cleanly struck nail, but it comes at the cost of having lost the interesting features of the original object. This is likely to be very unsatisfactory, however beautifully the nail has been driven home.

If you want to pursue the idea of providing better type hints (which is an admirable goal) then, rather than throwing away your dataframe, you should look for a way to express to the type hinting system some more nuanced information about the dataframe you have. By analogy: the tuple type itself tells us nothing about the tuple's length or the types of its elements. One approach would be to define your own dataclass with a fixed number of members with annotated types:

@dataclass
class Person:
    name : str
    age : int

and use that instead of the tuple. An alternative would be to explain the structure of the tuple to the type system:

Tuple[str, int]

However, this example makes it easy to overlook just how much capability is lost by writing your own dataclass, rather than using the original tuple/dataframe. In the case of the tuple you don't lose much; in the case of the dataclass the losses are immense.

Unfortunately, it's not at all clear to me whether/how it is possible to tell the type system anything more about the dataframe besides the fact that it is a dataframe. Initial searches haven't borne any fruit, and, as you know, I don't have the luxury of spending much time on this at present.

jacg commented 5 years ago

(In my copious spare time, so this will be a slow process) I'm looking into how we might provide rich typehints relating to dataframes.

I suppose it must be possible to create our own typing-like DataFrame (analogous to typing.Tuple allowing us to say Tuple[int, str] rather than simply saying tuple), but I'd rather not reinvent the wheel, so I'm trying to find some existing wheels. Here is something that might help in the case of Numpy arrays: https://andreacensi.github.io/contracts/

No idea whether it's any good or useful: just making a note of the fact that it exists, for future reference.

Of course, this whole discussion just makes me want to rewrite IC in Idris :-)

jacg commented 5 years ago

This post discusses how to improve Pandas DataFrame type-hinting by subclassing. In the process it also addresses replacing (nasty, horrible, to be avoided) strings in the DF interface with (lovely, discoverable) symbols: http://devanla.com/case-for-inheriting-from-pandas-dataframe.html .

Seems interesting at first blush. Would need to think about it, and its consequences, a bit more.

jjgomezcadenas commented 5 years ago

Very, very interesting, and a relief to realize that the worries of the infatuated functional programmer were not so much off the mark. I can give a try to this in the KrCalib code (see above), and experiment a bit with the ideas.

jacg commented 5 years ago

If you go down this path, you should write some metaprogramming utilities that remove the need for repetitive boilerplate, by generating the classes you want from a simple specification. (I now see that the article suggests something along those lines, but it does it in a way I recommend you DO NOT follow: namely generating source code. This is Python's equivalent of poor-man's Lisp macros, and there are usually better ways to write meta-programs in Python.)

Here's a proof of concept showing how you could write a decorator which generates the class for you:

import pandas as pd

# A symbol used to identify columns that should be present in the DataFrame
class Column: pass

# A decorator that can automatically generate (from a simple
# specification based on the class syntax) DataFrame subclasses with
# convenient getattr syntax for accessing columns
def SpecificDataFrame(cls):
    def make_column(name):
        def get(self       ): return self[name]
        def set(self, value):        self[name] = value
        return property(fget=get, fset=set)
    # Fore every column install a property that translates getattr
    # syntax to getitem syntax
    attrs = { name: make_column(name)
                    if value is Column
                    else value
              for name, value in cls.__dict__.items() }

    new_cls = type(cls.__name__,
                   (pd.DataFrame,) + cls.__bases__,
                   attrs)

    def _constructor(self):
        return new_cls

    new_cls._constructor = _constructor

    return new_cls

# An example of usage: Create a new DataFrame subclass called
# `ThisAndThat` with two columns: `this` and `that`:
@SpecificDataFrame
class ThisAndThat:
    this = Column
    that = Column

# Constuct it just like a normal dataframe
x = ThisAndThat(data=dict(this=range(5), that=list('abcde')))

# Check behaviour
assert x.this[3] ==  3
assert x.that[2] == 'c'

Note that this decorator provides the syntax x.column_name while the article's proposal gives the syntax x[x.column_name]. The latter has two disadvantages:

  1. Extra pair of `[]
  2. Repetition of the dataframe. This becomes especially annoying if you use explicit names: compare
    • data_frame_with_explicit_name.column_name
    • data_frame_with_explicit_name[data_frame_with_explicit_name.column_name] The latter is a right royal PITA.
jjgomezcadenas commented 5 years ago

An important point. Access to column data by attribute is granted in Panda. You can do:

df['name_of_column']

But in Panda you can also do:

df.name_of_column

What you CANNOT access by attribute is row name. In panda you must do:

df.loc['row_name'] 

or

dif.iloc[row_number]

What you cannot do either is to slice by attribute. You can do:

df[['color', 'food', 'score']]
jacg commented 5 years ago

Hmm, I'm getting confused between Pandas and Pytables again.

But, in that case, the original article is doing something pointless about the stringy API, as Pandas already addresses it.

jjgomezcadenas commented 5 years ago

But, in that case, the original article is doing something pointless about the stringy API, as Pandas already addresses it.

Yes. Pandas adresses columns by symbol (e.g, attributes). One has still the problem on how to specify the names of the columns when sublclassing (so that one knows the attributes of the object at hand).

jjgomezcadenas commented 5 years ago

So compare

@dataclass
class KrEvent(Point):
    """Adds raw energy/time"""
    S2e  : Array
    S1e  : Array
    S2q  : Array
    T    : Array  # time
    DT   : Array  # time difference in seconds
    E    : Array
    Q    : Array

with

class KrDF(DataFrame):
    x    = Column
    y    = Column
    z    = Column
    s2e  = Column
    s1e  = Column
    s2q  = Column
    t    = Column
    dt   = Column
    e    = Column
    q    = Column
    @property
    def _constructor(self):
        return KrDF

Improvements:

  1. pythonic notation
  2. Object is a DF, with all DF goodies.
  3. Object type is KrDF, OK for typehints.
  4. Relevant fields are declared with the Column notation (nice trick).
  5. All non-sense inheritance eliminated.

Progress, I suppose.

jjgomezcadenas commented 5 years ago

Actually, the construct above doesn't work well. It will produce a DF and one can access the attributes by... string! if we try krdf.x we obtain (of course) an object of type "Column". Of course, if we do: x = "x" as in post, then we can access krdf[krdf.x] which is ugly.

So it is not really true that you transport the attributes of the DF (eg, df.x) when you subclass. It appears that doing it by hand as @jacg proposes is necessary. If Y try with @jacg decorator I get the desired krdf.x behaviour. Alas, the object returned by @jacg has unexpected behaviour... for instance it will not print or repr (try and get a nasty error). Do we need to add str and repr methods?

jacg commented 5 years ago

class KrEvent(Point): """Adds raw energy/time"""

Please get out of this habit: docstrings are NOT comments!

if we try krdf.x we obtain (of course) an object of type "Column"

Well, yes, the Column was a marker understood and replaced by the decorator. If you don't use the decorator, don't expect anything interesting to happen to the Column attributes.

Alas, the object returned by @jacg has unexpected behaviour... for instance it will not print or repr (try and get a nasty error).

I can't reproduce that here. Specifically, the x created thus x = ThisAndThat(data=dict(this=range(5), that=list('abcde'))) in my earlier sample, has str and repr which work identically to those of plain dataframes. Please provide a concrete and complete, reproducible code example of something that doesn't work ... otherwise I shall have to conclude that PEBKAC.

So it is not really true that you transport the attributes of the DF (eg, df.x) when you subclass.

This claim appears to be at odds with reality:

import pandas as pd

class MyDF(pd.DataFrame):
    @property
    def _construct(self):
        return MyDF

data = dict(this=range(5), that=list('abcde'))

d = pd.DataFrame(data=data)
m =         MyDF(data=data)

assert d.that[2] == 'c'
assert m.that[2] == 'c'
jjgomezcadenas commented 5 years ago

Sorry, yesterday I was working too late. Here is an update:

https://github.com/nextic/KrCalibNB/blob/gallery/tutorials/pandas/dataFrames.ipynb

The point is: the decorator works, but nothing prevents the use from ignoring the Columns defined, which in practice become "comments". Take a look to NB to see what I mean.

jacg commented 5 years ago

What do you mean by 'ignoring the Columns'? The Columns are there as markers for the decorator to do whatever it wants with the information. If you want the object to reject construction with and/or setting/reading of attributes which are not explicity marked with Column then you have to get the decorator to install that ability into the class it produces. At the moment (being a simple proof-of-concept) it makes no attempt to restrict columns to the specified set, so it's not surprising that you can create any columns you like.

jacg commented 5 years ago

Here's a proof of concept showing that the decorator can in principle make the resulting class reject undeclared columns at construction time. Again, it's just a proof of concept, so don't expect it to do things which you may consider desirable but which have not been implemented yet! (As a specific example, dataframes admit a squillion different construction styles, and this proof of concept only deals with the single style that is used in the example at the end of the code.)

import pandas as pd
import pytest

# A symbol used to identify columns that should be present in the DataFrame
class Column: pass

class Complain(Exception): pass

def SpecificDataFrame(cls):
    namespace = {}
    allowed_columns = set()
    for name, value in cls.__dict__.items():
        if value is Column: allowed_columns.add(name)
        else              : namespace[name] = value

    new_cls = type(cls.__name__, (pd.DataFrame,) + cls.__bases__, namespace)

    def _constructor(self):
        return new_cls

    def __init__(self, *args, **kwds):
        illegal_columns = set(kwds['data']).difference(allowed_columns)
        if illegal_columns:
            raise Complain(f"{cls.__name__} does not admit columns: {', '.join(illegal_columns)}")
        super(new_cls, self).__init__(*args, **kwds)

    new_cls._constructor = _constructor
    new_cls.__init__     = __init__

    return new_cls

# An example of usage: Create a new DataFrame subclass called
# `ThisAndThat` with two columns: `this` and `that`:
@SpecificDataFrame
class ThisAndThat:
    this = Column
    that = Column

# Constuct it just like a normal dataframe
x = ThisAndThat(data=dict(this=range(5), that=list('abcde')))

# Check behaviour
assert x.this[3] ==  3
assert x.that[2] == 'c'

# Try to construct with data containing undeclared columns: raises Complain
not_allowed = ThisAndThat(data=dict(this=range(5), other=list('abcde')))
# Output: traceback ending in
#     Complain: ThisAndThat does not admit columns: other

There's bound to be a more robust way of implementing this, but this is just a proof of concept.

jacg commented 5 years ago

Edit: missing s in self.columns !

The following should be a much more robust way of detecting illegal columns:

    def __init__(self, *args, **kwds):
        super(new_cls, self).__init__(*args, **kwds)
        present_columns = set(self.columns)
        illegal_columns = present_columns - allowed_columns
        if illegal_columns:
            raise Complain(f"{cls.__name__} does not admit columns: {', '.join(illegal_columns)}")

In other words, let the real DataFrame constructor do its job first, and once that is done, make our subclass complain if that resulted in any undeclared columns being present in the about-to-be-given-to-the-user instance.

jacg commented 5 years ago

Now, appealing though subclassing DataFrame might be, it is philosophically the wrong thing to do: subclasses inherit the whole interface of the supreclass, and should we usable wherever the superclass is used. Subclasses created with this decorator break lots of things that they inherit from DataFrame (df.T is an obvious example).

In other words, subclasses normally extend the interface of superclasses, while these effectively (though not formally) narrow the interface.

We need to think about whether (as the Zen of Python suggests) practicality sufficiently outweighs purity in this case.

I see 3 alternatives, if we are to pursue this whole idea:

  1. Practicality beats purity: just subclass, and accept that some things are broken; but as we don't want or need them, that doesn't matter and the savings we make by getting the job done quickly outweigh the costs of having nonsense in the interface. (ROOT fans should be a-OK with this!)

  2. Make the decorator explicitly override the things which are broken / not expected to work, and raise sensible exceptions when they are used. This is 'better' but requires more work: we need to identify and explicitly disable everything that breaks. (But it still leaves crap in the dir)

  3. Do the morally correct thing: implement these by aggregation. This is even 'better', but also requires more work: you need to identify and explicitly enable everything that should work.

In both 2 & 3 the extra work will need to be done exactly once: in the decorator.

jjgomezcadenas commented 5 years ago

Do the morally correct thing: implement these by aggregation. This is even 'better', but also requires more work: you need to identify and explicitly enable everything that should work.

Neither the IFP nor the NCT have the foggiest idea what 'implement by aggregation' means

jacg commented 5 years ago

I haven't the foggiest idea what the IFP and the NCT are.

class ClassThatDoesTheWork:

    def wanted(self):
        return 'Working hard.'

    def unwanted(self):
        return "Our specific thing shouldn't be able to do this!"

class Inherit(ClassThatDoesTheWork):

    def wanted(self):
        return super().wanted() + " Work even harder."

class Aggregate:

    def __init__(self):
        self._thing = ClassThatDoesTheWork()

    def wanted(self):
        return self._thing.wanted() + " Work even harder."

i = Inherit()
a = Aggregate()

assert i.wanted() == a.wanted()
assert not hasattr(a, 'unwanted')
print('So far so good ...')
assert not hasattr(i, 'unwanted') # Oh dear, it's there!
jjgomezcadenas commented 5 years ago

OK, clear. IFP : Infatuated Functional Programmer, @jjgomezcadenas NCT: Newly Converted Theorist. @mmkekic

jacg commented 5 years ago

We can automate much of the implementation-as-aggregate with __(get/set)attr[ibute]__ and __dir__ but there will still be tricky issues to resolve. The most obvious of these probably relate to the features which use _constructor[_(sliced/expanddim)]. So we should ask ourselves whether it wouldn't be better to just use plain DataFrames and find a way of providing richer DataFrame type hints, along the lines of DATAFRAME[apple, banana]. Then I look at your example above, and ... well, the thought of seeing type hints that look like DATAFRAME[Array, Array, Array, Array, Array, Array, Array] all over the code makes me feel nauseous. I guess that the solution to that problem is MeaningfulName = DATAFRAME[Array, ... , Array]. But then you're bound to find two different meaningful concepts which happen to have the same structure, and suddenly your type hinting system will think they are the same, even if your type annotations make you believe that they are different, because you are using different names in your type hints (but they are bound to the same type object).

Here's a demonstration of the idea in Haskell: We have two different types Position and Velocity which have isomorphic underlying representations.

type Position = (Float, Float)
type Velocity = (Float, Float)
type Time     =  Float

move :: Position -> Velocity -> Time -> Position
move (x,y) (vx, vy) dt = (x+vx*dt, y+vy*dt)

p = (1,2)
v = (3,4)
t = 5

new_p = move p p t -- type error in second argument!

This program compiles and runs, even though we passed a Position as the second argument to move where a Velocity was required. It compiles because Position and Velocity are merely synonyms for (Float, Float). This is equivalent to the following in Python

Position = Tuple[float, float]
Velocity = Tuple[float, float]
Time     = float

def move(p: Position, v: Velocity, t: Time) -> Poisition:
    # implementation goes here

The human reader thinks that there are two different types, but the type system sees the same type.

Here's an alternative Haskell implementation of the program

newtype Position = P (Float, Float)
newtype Velocity = V (Float, Float)
newtype Time     = T  Float

move :: Position -> Velocity -> Time -> Position
move (P (x,y)) (V (vx, vy)) (T dt) = P (x+vx*dt, y+vy*dt)

p = P (1,2)
v = V (3,4)
t = T  5

new_p = move p p t -- type error in second argument!

This time the compiler spots the problem:

foo.hs:12:16: error:
    • Couldn't match expected type ‘Velocity’
                  with actual type ‘Position’
    • In the second argument of ‘move’, namely ‘p’
      In the expression: move p p t
      In an equation for ‘new_p’: new_p = move p p t

I suspect that Python's type hints machinery doesn't offer this possibility (of distinguishing between isomorphic types), but I'd be happy to be proven wrong.