Open jennybc opened 8 years ago
Thanks for the rant and for creating this issue.
No one should listen to anything I say about spreadsheets! I have one that is truly horrible -- lots of cross-sheet formulas and hyperlinks, F2 comments, etc. Perhaps I overvalue human-readability because I have printed so many pages for my lab notebook over the years. That is another reason I prefer tab as a delimiter, and why I like the LOCF and the drag-to-fill "riding the rails" thing.
Here is a nice example of disciplined spreadsheet use when more sophisticated approaches become unworkable: www.zibraproject.org/blog/metadata-wrangling/ Undoubtedly you have seen Karl's advice, which maps to your list of desired properties pretty well. http://kbroman.org/dataorg/
@jshoyer That Zibra link is great!
Data nerds tend to go on a hallucinatory Vision Quest when cooking up data management strategies for execution by civilians.
It's very telling that the "AWS instance running a RethinkDB server + the app Chateau to provide a web UI + Python interaction scripts for upload" was DOA once it hit the reality of sketchy internet and actual human beings treating Zika patients. So what do they fall back to? Excel, of course! Because it actually works. I'd love to see a sanesheet-compliant tool for data entry and marshalling that enforces the schema described there.
Awesome project! Two thoughts:
1) I think a big part of spreadsheets is that a lot of people already know how to use them, and some of them have very sophisticated mental models which rely on the spreadsheet form of computation. The $64,000 question IMO is how to leverage parts of that knowledge (so that they switch from spreadsheets to the new tool) without having the new tool do the same bad things that spreadsheets do. Too many of the programmatic alternatives to Excel (Tableau for instance) basically keep the mental model behind the spreadsheet computation. This means they just replicate all the same spreadsheet problems.
2) Spreadsheets are amazing data entry tools. I think this is the first layer of humus for an R-ish spreadsheet solution. Just something that lets people do data entry in a seamless way from their R environment. This could either produce a .csv file, or just the R code which would generate a data.frame. I definitely find myself switching to a spreadsheet when I need to create a lookup table or something like that, it's just so much easier than typing lookup <- data.frame(a - ...., b = ....)
(Humus instead of hummus gives a totally different mental image 😉)
I clearly think we should be pursuing software targets which are as important to the data science ecosystem as topsoil is to the regular ecosystem. :)
I wonder if barring formulas might be too hasty. If the actual backend data format is just text, formulas could be fine if you require that they apply to a whole column and only a column. The application would apply the formula and simply populate the csv with the appropriate values. The end result is still static.
My rationale is thinking about people pulling the resulting text file into things other than R. It's very possible it will be useful for them to have already created variables (columns) that are simple functions of other columns when, say sucking the csv into SPSS.
@joranE OK but barring formulas = not having to implement formulas 😃. Difficult to say if this is great wisdom or laziness or both.
The other issue with the restriction on formulas (which is a good restriction from a reproducibility perspective) is that most Excel users aren't that used to using that model (they are called array formulas in Excel I think).
When I read Jenny's tweet, I assumed that she was referring specifically to using Excel spreadsheets as a format for distributing data sets, rather than their myriad other uses. This is the use of Excel files that most puzzles me. The way I see it, data sets should be in CSV files, with the only metadata in the first row, and everything else somewhere else, like a codebook. But, instead, my R scripts are littered with lines like
wb <- wb[-c(9417:9421),] # trim last 5 non-data rows
to deal with the random notes folks have left at the bottom or top. (And, with each release, I have to inspect the file to find the cruft and hack it off again. Or I may just load the file into OpenOffice or Excel, clean it, and export to CSV before I use it. Either way, wasted effort.)
There are things that CSV doesn't do that would be useful for this purpose, like declaring data types or validation restrictions. It might also be nice to be able to supply other metadata. As I understand it, this is what formats like Stata's .dta format accomplish. But .dta doesn't degrade gracefully: if you're not running Stata, or some other software that understands the format, the file is useless to you. (Also, it doesn't look friendly for manipulating with grep, sed, etc., as CSV is.) So, to my way of thinking, the ideal format is CSV plus something, so that the CSV file is always there, just as the gods intended it. A lot of things would work for the "plus", and a lot of things would work for bundling it all up, so long as they are things that are open and preferably human-readable/editable (for some definition of "human"). And this would also mean multiple files (even if bundled in an archive).
It seems to me that something with these modest features would be pretty easy to support (since the tools exist to handle the pieces already) and it would be useful to everyone right out of the gate (since the key thing, the data itself, is in a CSV file that every sane stats program already knows how to read). Once you move beyond that, though, I wonder whether it becomes reinventing the wheel. What is the use case not already handled by ODF, Shiny, or some other open format? Am I missing something about what a full-blown sanesheet would do, or how it would do it differently?
One more thing about bundling multiple files. I think multiple files, optionally in uncompressed tarballs, is the best approach because that makes for efficient deduping when storing in a repo; transmitting via rsync; backing up; etc. The tarball can always be compressed as the last step in distribution, but it shouldn't be required in the specification.
Edited to add: I didn't know about CSVY before reading the other issue here where it's mentioned. The two-file version of this is exactly the sort of thing I had in mind. And maybe it already accomplishes everything I was thinking about?
Issue of delimiter: this has been debated before, on twitter I think? Steen Hoyer advocates tab because says is advantageous for copy/paste reasons.
Drag to fill: I said useful for populating cells when empty might otherwise mean "copy the value from above or from the left" Steen Hoyer says more useful for auto-numbering. He leaves empty cells (!!!) and uses
tidyr::fill()
.Hadley reinforced existing rant re: registering hand edits and patches via, e.g., git.
Hadley brought up interesting issue and distinction: auto-formatting (good) vs auto-conversion or type imputation (monstrous).
Hadley suggested not allowing special characters in col names. Also "automatic LOCF" but I don't know what that means. Aaah it means "last observation carried forward", which already comes up above. I've always called that "shit trickles down" 💩.
cc @hadley @earino @jshoyer