jennybc / sanesheets

A rant about spreadsheets.
48 stars 1 forks source link

Limitations of the spreadsheet mental model #2

Open gshotwell opened 8 years ago

gshotwell commented 8 years ago

I wanted to write up some thoughts on the spreadsheet mental model in case their helpful. Some background is that I had an excel-based job for a few years before switching to R in Law School, and I found the transition really difficult for the following reasons which I don't see discussed that often:

1) Spreadsheets push you to use the same space for computation and data. This has a lot of problems for reproducibility but it also gives you a mental bias that your calculations should be the same "shape" as your data. What I mean by that is if you multiply a bunch of numbers by 2, you naturally think that you should multiply each number by 2, and that the space in which you store that computation should be the same size as the data. This is different from a programming language which lets you multiply a vector v by using v * 2. Even if the vector is a million rows, you still just need the one formula, but in Excel you would create a million little formulas to do the calculation. This is another way of just saying that spreadsheets make abstraction difficult, but I think a core part of how excel users think: make the calculations the same shape as the data.

2) Spreadsheets don't have time. When you write a script in R there's a sense that your data is being manipulated both in space (you create a new variable) and in time (you change an existing variable). So while most programmers think about their analysis project more temporally, excel user's have trouble with that. Adding in the temporality is kind of what Alterix, and I've often thought that it would be neat to have a similar interface built around pipes in R.

Anyway in my experience those are the two fundamental things that I needed to change about the way I thought about data to move from Excel to R. The tricky thing is that these are the things that both make R hard to learn for an Excel user, and also what makes it a useful thing to do. A lot of the solutions for moving people from spreadsheets to R (like deducer) don't really help with this transition because using R with an Excel mental model is not that much better than using Excel.

jennybc commented 8 years ago

@GShotwell Those are great observations. I too worked intensely in Excel (in management consulting) before deciding to take more math and apply to stat grad school. It was a formative experience actually. I learned that I love data analysis but suspected I would rather do it in a more proper way.

Re 2) and the concept of time. This is both a strength and a weakness of spreadsheets. A flattering way to describe them is that they are reactive. You update data? In general, figures and computed cells also update automatically (ok, maybe not pivot tables, as I recall). To get the same result, we R users need a way to automatically trigger re-computation.

Re 1) YES! And if you have enough savvy to store 2 somewhere and use a cell reference (let's say it's an exchange rate that might change), it's a great dilemma where to store it. Because it doesn't have any natural physical home relative to the main data table.

One behavioural change that would increase spreadsheet sanity is to think of individual worksheets as being either for data or for computation. But never both.

jchrom commented 7 years ago

we R users need a way to automatically trigger re-computation

Those who like reactivity can go for R Notebooks. Not very practical for large data or complex computations (but that's not what Excel people typically need).

it doesn't have any natural physical home

When I have to work with spreadsheets, I always have one sheet called "User input" with a list of key-value pairs. The formulas that use them are in a locked sheet to prevent users (including myself) from mindless editing.

But this is still too brittle. Because there is no assignment operation, the values are indexed by their location (A1) and not by what key is next to them ("Exchange rate" in B1, for instance). So the actual value must never move out of A1, and if you by accident change the content in B1, you will never know what that A1 is for (unless you do some digging in the actual formulas).

It can be kind of solved if you have keys in one column and values in another and use VLOOKUP to find the values by their keys. But that increases the length of already lengthy formulas further in the process.

A saner spreadsheet, I think, would (automatically) give you an overview of all the parameters that need to be set by the user, and display them in a dedicated sheet, where you would only be able to set their values but nothing else.

gshotwell commented 7 years ago

I added some more thoughts on this to my new and buggy Blogdown blog: https://gshotwell.github.io/Gordon_Shotwell/post/r_for_excel_users/