sjackman commented 8 years ago

I'd like to have multiple sheets in one file to keep data and metadata together in one file. This suggests that we need a container format to keep multiple TSV files together in one file. We could use an off-the-shelf container format, such as tar or zip. I would like however for the container format to be a plain text format to enable editing the entire document in a text editor and checking it into version control. I'm going to some requirements below, and then a random assortment of formats that came to mind.

A plain text container file format that contains multiple plain text files
Can contain TSV, CSV and Markdown PSV (pipe-separated values)
Can contain any plain text file, particularly Markdown (and RMarkdown)
Each embedded plain text file has a name (like a sheet name or file name)
Editable by a human with a plain text editor
Aesthetically pleasing

sjackman commented 8 years ago

Tables separated by blank lines

See http://johnkerl.org/miller/doc/file-formats.html#CSV/TSV/etc.

Example

# name: sheet1.tsv
A   B
1   X
2   Y

# name: sheet2.csv
C,D,E
3,4,5

# name: sheet3.md
| F | G |
|---|---|
| 0 | 1 |

Advantages

Simple!
miller (aka mlr) supports this file format, though without # comments and they all have to be the same type (all TSV for example).

Disadvantages

It can't contain any file that contains a blank line (since a blank line is the file separator).
Not all programs handle # comments in TSV/CSV files.

sjackman commented 8 years ago

Tables in Markdown

See https://help.github.com/articles/organizing-information-with-tables/

Example

# Tables in *Markdown*

```tsv sheet1
A   B
1   X
2   Y

C,D,E
3,4,5

Table: A pandoc-crossref table caption {#tbl:sheet3}

F	G
0	1



## Advantages
1. The container format (Markdown) is already defined!
2. Markdown editors already exist! Texts http://www.texts.io can edit Markdown tables.
3. Need to teach programs like `mlr`, `datamash`, `readr`, `RStudio` et c to read Markdown files and extract tables

## Disadvantages
1. The container format is Markdown, but it can't really contain a Markdown file.

jennybc commented 8 years ago

Do you think it has to be one file? What if, instead, it were one directory? The way that Git repos or RStudio projects are just regular directories on your computer, with some other, mostly hidden stuff lying around to help Git and RStudio do their jobs. An R package is a special case that has even more specific structure about the files and directories. Maybe that's a model for a sanesheet?

A sanesheet-anticipating tool serves the different files to you via conventional tabs. And tabs have an interact / edit mode appropriate to the file type. As delimited file for the TSVs, as markdown for the notes. So I guess that suggests there has to be a .sanesheets file, like foo.Rproj or the entire .git directory, to facilitate coordinated actions across the whole set of files.

I think the point is that someone who wants to experience a sanesheet like a spreadsheet could. Well, if this mythical inpsector/editor existed! But the code-based analyst could experience it like a subdirectory of delimited files and some notes. And all could enjoy the benefit of version control.

sjackman commented 8 years ago

MIME (Multipurpose Internet Mail Extensions)

See https://en.wikipedia.org/wiki/MIME

Example

MIME-Version: 1.0

Content-type: text/tab-separated-values; name=sheet1.tsv; boundary=END

A   B
1   X
2   Y
--END

Content-type: text/csv; name=sheet2.csv; boundary=END

C,D,E
3,4,5
--END

Content-type: text/markdown; name=sheet3.md; boundary=END

| F | G |
|---|---|
| 0 | 1 |
--END

Advantages

The container format (MIME) is well defined, supported and implemented. brew install ripmime && ripmime -i example.sanesheet will extract three files from this one file.
Can contain any arbitrary file, including binary files (like PNG files!) using base64 encoding.
Need to teach programs like mlr, datamash, readr, RStudio et c to read MIME files.

Disadvantages

Not as aesthetically attractive (as say Markdown), but not too shabby either.

jennybc commented 8 years ago

Multi-part MIME is pretty intriguing.

I've always wanted to interact with ~~spread~~sanesheets via my mail client 😉.

sjackman commented 8 years ago

Directory of files

Example

❯❯❯ ls
sheet1.tsv sheet2.csv sheet3.md
❯❯❯ head *
==> sheet1.tsv <==
A   B
1   X
2   Y

==> sheet2.csv <==
C,D,E
3,4,5

==> sheet3.md <==
| F | G |
|---|---|
| 0 | 1 |

Advantages

Already exists. It's pretty much the current system.
Can contain any arbitrary file type.
Can be tarred or zipped up to create a single file.

Disadvantages

Files (like metadata and data) can easily get separated when someone e-mails the data sheet without including the metadata sheet.
Emailing a directory to someone requires first zipping it up.

sjackman commented 8 years ago

I like "Tables separated by blank lines" for its simplicity, but I believe it has the least support (of the options given) by programs in the wild.

I like "Tables in Markdown" for its aesthetic qualities and existing adoption. Markdown editors already exist, and I believe some already have support for editing tables. This option is probably the most well implemented by existing tools.

I like MIME because the format is simple, well defined and widely implemented. There's lots of existing code/libraries to read/write these files. It can contain binary files, like graphical plots.

sjackman commented 8 years ago

Do you think it has to be one file? What if, instead, it were one directory?

A directory of files certainly has the lowest barrier to entry.

I'd like to bundle up multiple files into a single container file so that they're not easily separated, so that for examples when someone e-mails you the data sheet it doesn't get easily separated from the metadata sheet.

People are accustomed to a single spreadsheet file (XLS for example) that contains all the related sheets of one project. I'd like to give people an option that has this same behaviour, but with a plain text file format.

sjackman commented 8 years ago

Twitter poll! https://twitter.com/sjackman/status/776860608629055488

sjackman commented 8 years ago

JSON (JavaScript Object Notation)

See http://www.json.org

Example

{
"sheet1": {
"A": [1, 2],
"B": ["X", "Y"]
},
"sheet2": {
"C": [3],
"D": [4],
"E": [5]
},
"sheet3": {
"F": [0],
"G": [1]
}
}

Advantages

Well defined format that is intended for storing data.
Can be easily read by R and Python.
Has command-line tool support with jq and mlr.

Disadvantages

Not terribly pretty.
The table looks transposed, with one column per line.

mr-c commented 8 years ago

When you are ready for provenance, attribution, and extensible metadata then http://www.researchobject.org/ will be waiting for you :-)

sjackman commented 8 years ago

A Research Object Bundle is a zip file of arbitrary files (like TSV files) that also contains a JSON file describing the provenance/attribution of those files. https://researchobject.github.io/specifications/bundle/ I'm hoping for a plain-text format that can be committed to git. The unzipped zip file can of course be committed to git, in which case we have a "Directory of files" with an additional .ro/manifest.json file to describe the files.

sjackman commented 8 years ago

@jennybc @hadley Does RStudio have a table editor to edit data frames and TSV files?

jennybc commented 8 years ago

No ... but maybe it could one day!

sjackman commented 8 years ago

CSVY (CSV with YAML frontmatter)

See http://csvy.org

Example

---
name: sheet1.tsv
---
A   B
1   X
2   Y

---
name: sheet2.csv
---
C,D,E
3,4,5

---
name: sheet3.md
---
| F | G |
|---|---|
| 0 | 1 |

Advantages

The format is already defined for a single sheet.
Looks nice and is similar to the familiar Markdown with YAML frontmatter.
Has two R implementations! https://cran.r-project.org/web/packages/csvy/csvy.pdf and https://cran.r-project.org/web/packages/rio/rio.pdf

Disadvantages

YAML frontmatter would likely break existing TSV/CSV parsers. Can be stripped out with

sed '/^---$/,/^---$/d'

sjackman commented 8 years ago

@jennybc Any thoughts/preferences on these file formats? My favourites are

JSON for a machine-readable, well-defined and implemented format.
CSVY for the most attractive human-writable/readable format, though I would use TSVY myself.

I really like the look of TSVY and its similarity to Markdown with YAML front matter. I quite like the idea of storing the code in RMarkdown with YAML frontmatter and storing the data in TSV with YAML frontmatter. I could even see concatenating the code and data files for the occasions where you may want to store both in a single file, and then I think you may have a real competitor for an Excel spreadsheet.

jennybc commented 8 years ago

I have JSON filed along with 😱 XML in my head. As in, if I don't have a nested or recursive structure, why would I go there? It's also not that human readable. For normal humans.

So I like these tab and comma delimited formats with some meta-data in a header, yes. For the bits that are data.

I think it is important to think about the motley assortment of things people park in a spreadsheet. I honestly think the neat packaging of disparate objects is part of what users like. I see a fair number of spreadsheets where the data worksheets really could be csv files. But then there's always the "README as worksheet" lurking at the front or the back.

sjackman commented 8 years ago

Yes, I agree with needing a format to wrap up prose, code, data and report in a single file. Currently that lives in 3+ files: the prose and code in one RMarkdown file, the data in multiple TSV files, and the the rendered report in an HTML file. I'd like a format to stuff all that in a single text file. I prefer having multiple files when building a data analysis pipeline, but when sending a report to a collaborator, I prefer a single file.

jennybc commented 8 years ago

This is why I think your multipart MIME idea is not crazy. This is also why xlsx is a zip archive. Unless people can get comfortable with a directory, you have to shove it all into something 😕.

sjackman commented 8 years ago

I like the MIME idea for storing files of different types all in a single text file: the RMarkdown, the data and the HTML. For the related but simpler problem of how to stuff multiple sheets (TSV tables) in a single file, I prefer the TSV tables separated by YAML frontmatter blocks.

jennybc commented 8 years ago

Do you think it's necessary to stuff multiple sheets in a single file? Why?

sjackman commented 8 years ago

So that double-clicking a single file opens up that file in some pretty app (that doesn't yet exist) showing all the sheets in tabs, giving a similar user experience to an Excel spreadsheet.
So that data and metadata don't get separated when someone e-mails a TSV file of one but omits the other.

sjackman commented 8 years ago

See this Twitter conversation thread: https://twitter.com/BaCh_mira/status/777511003017867264

@BaCh_mira

Directories are problem when people/scripts forget to copy/ftp parts. Specially when /* contains more than they want.

@mike_schatz

tarball?

@BaCh_mira

Then program has to navigate tarball. Back to square one. Or have tar/untar copies: not cool w/ large files (100GiB)

jennybc commented 8 years ago

Oh yes I definitely think this whole bundle of files needs to be packaged in some way yes. I thought you meant that, within that main receptacle, you wanted to get all the TSV files into one file. I misunderstood.

This is why the MIME idea is interesting, because it already anticipates very disparate things, with a pre-existing vocabulary for declaring what things are, signalling where they begin/end, etc.

sjackman commented 8 years ago

In the case when you're bundling all different types of files (TSV, RMarkdown and HTML), I like the MIME format. The TSV in that MIME file can be just simple TSV without YAML blocks.

In the case when you're bundling just TSV files (no RMardkon, no HTML) I prefer one file of TSV tables separated by YAML blocks (no MIME).

If we only cared about bundling tables (sheets) into a single file, I'd prefer the TSV/YAML solution. If we want to tackle the whole enchilada, MIME is looking good.

sjackman commented 7 years ago

Success reading a multipart MIME sanesheet! So excite! Are we on to something?

eg.sanesheet

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=END

Title: An example sanesheet
--END
Content-Disposition: file; name="sheet1"; filename="sheet1.tsv"

A   B
1   X
2   Y
--END
Content-Disposition: file; name="sheet2"; filename="sheet2.tsv"

C   D   E
3   4   5
--END--

read_sanesheet.r

library(purrr)
library(readr)
library(webutils)

multipart <- parse_multipart(read_file("sanesheet.tsv.multipart"), boundary = "END")
sheets <- map(multipart, function(x) read_tsv(x$value))
sheets

$sheet1
  A B
1 1 X
2 2 Y

$sheet2
  C D E
1 3 4 5

read_sanesheet.sh

This sanesheet can also be extracted to one-file-per-sheet at the command line using ripmime.

❯❯❯ brew install ripmime
❯❯❯ ripmime -i eg.sanesheet -d eg
❯❯❯ head eg/*
==> eg/sheet1.tsv <==
A   B
1   X
2   Y

==> eg/sheet2.tsv <==
C   D   E
3   4   5

==> eg/textfile0 <==
Title: An example sanesheet

sjackman commented 7 years ago

@jennybc What do you think of the name and file extension .tidysheet?

hadley commented 7 years ago

I really think that multiple files is the only sane way to go. Would be better treat like an RStudio project, with one metadata file that people click on to launch the whole business. Then you could also mingle in other files (like R scripts etc).

sjackman commented 7 years ago

It's difficult to e-mail a directory files. I don't have any inside scoop, but I think that's why Apple migrated .pages, .numbers et c from their directories of files (bundles) to flat files, and that's even with special OS support to make the directory look to the user like a single file. A directory of files will eventually need to be zipped up to send to someone. A more likely outcome is that someone e-mail the data sheet and just leave out the metadata file, and the two will be separated.

See this Twitter conversation thread: https://twitter.com/BaCh_mira/status/777511003017867264 https://github.com/jennybc/sanesheets/issues/3#issuecomment-248465568

ODS and XLSX use a zip file of files as the file format. Michael @mr-c mentioned http://www.researchobject.org/, which is a zip of files. zip as a container format is alright, but it's binary and can't easily be edited with a text editor or committed to version control. MIME is a nice container format because it is plain text, and it faithfully represents a directory of files.

To woo/convert spreadsheet users, one key feature currently missing is the ability to store multiple sheets in a single file. I'd like the format of that file to be plain text.

hadley commented 7 years ago

Yes, that's a downside, but I think the downsides of the other options (i.e. one massive opaque file that requires special software to read) are worse.

sjackman commented 7 years ago

Zip files are quite opaque, which is why I'm not a big fan of zip as a container format for plain text files. MIME is quite readable, as far as standard plain-text container formats go. What exactly do you mean by opaque? The contents of the MIME file example in https://github.com/jennybc/sanesheets/issues/3#issuecomment-248517695 are pleasingly transparent in my opinion.

hadley commented 7 years ago

They are pleasingly transparent to you as a programmer, but what existing tools can easily extract data out of a file of that nature?

The structure of MIME also does not lead itself to good performance - if you have 200 meg csv file followed by a markdown readme, you'll have to scan all 200 megs of lines to find the md file.

People don't seem to have issues sharing RStudio projects?

sjackman commented 7 years ago

They are pleasingly transparent to you as a programmer, but what existing tools can easily extract data out of a file of that nature?

ripmime for the shell and webutils::parse_multipart for R to name two. rio::import can read a zip of TSV files. I'd be up for submitting a PR to add support for MIME as a container.

sjackman commented 7 years ago

The structure of MIME also does not lead itself to good performance - if you have 200 meg csv file followed by a markdown readme, you'll have to scan all 200 megs of lines to find the md file.

That's also true of a tar.gz of files. I'm not sure whether zip is indexed. You can get random access to individual files by extracting the MIME container, same as with tar.gz and zip of a directory.

sjackman commented 7 years ago

People don't seem to have issues sharing RStudio projects?

People with the technical ability to create RStudio projects don't have an issue sharing RStudio projects. I'm hoping to target spreadsheet users with two key features:

You can double-click the file and open it in a graphical spreadsheet editor that would show one tab per sheet.
You can e-mail this single file, without having to create an archive of a directory, so that its sheets don't get separated.

hadley commented 7 years ago

If you use a single file format, you also need to expose UI for everything you can do with your file browser: delete sheets, copy sheets from another project, ...

Also for anything other than trivial data, you'll need to compress the contents in order to email, so that means you'll need a layer on top of mime.

sjackman commented 7 years ago

Also for anything other than trivial data, you'll need to compress the contents in order to email, so that means you'll need a layer on top of mime.

Gmail's file size attachment limit is 25 MB. That's good enough I would hazard for many typical Excel spreadsheets without being compressed (though XLSX is compressed). Larger files can be transferred via a link on Dropbox, same for any data file larger than the e-mail attachment limit.

sjackman commented 7 years ago

If you use a single file format, you also need to expose UI for everything you can do with your file browser: delete sheets, copy sheets from another project,

I don't feel that's onerous. Only two operations are needed: create a new blank sheet and delete a sheet. Good old copy-and-paste can be used to copy a sheet between two projects.

sjackman commented 7 years ago

Texts (http://www.texts.io) can edit Markdown tables.

tables md

sjackman commented 7 years ago

rio::import will be able to read multiple tables from an HTML file once https://github.com/leeper/rio/issues/126 is resolved.

danfowler commented 7 years ago

@hadley: I really think that multiple files is the only sane way to go. Would be better treat like an RStudio project, with one metadata file that people click on to launch the whole business. Then you could also mingle in other files (like R scripts etc).

@sjackman Sorry to jump in so late, but Tabular Data Packages might be an option here. You can store multiple TSVs in a single directory and define a datapackage.json that provides high-level metadata (e.g. license info); a Table Schema that defines column descriptions, types, and constraints (csvy also uses this); and information about the CSV dialect. For serialization into one file, we have an open issue here: https://github.com/frictionlessdata/specs/issues/132 . Version 1.0 coming super soon.

R implementation: https://github.com/ropenscilabs/datapkg
Python library: https://github.com/frictionlessdata/datapackage-py

Stephen-Gates commented 7 years ago

You may be interested in this new project Data Curator. We're building on top of Data Packages, Comma Chameleon and other goodness. Development starting this week.

sjackman commented 7 years ago

That's very exciting! Thanks for the heads up, Stephen. I'll be a willing beta tester, if you're looking for early outside users.

Stephen-Gates commented 7 years ago

Hope to go Beta at release 0.3.0 should be of some value at that stage for single data files. Everyone is very welcome to test and report issues.

sjackman commented 7 years ago

Excellent. I don't know of a way to watch releases or completed milestones in GitHub. Could you suggest an issue, or create an issue to which I could subscribe, that would be updated/closed when 0.3.0 is released?

Stephen-Gates commented 7 years ago

Watch releases with https://github.com/ODIQueensland/data-curator/releases.atom

sjackman commented 7 years ago

I and I imagine other users don't use an RSS feed reader as part of my workflow. I'd prefer to use the GitHub web interface to watch notifications. Would you consider creating a low-volume locked issue that you comment on once per release, so that users like myself can subscribe to that issue? See for example https://github.com/Linuxbrew/brew/issues/1#issuecomment-313844775

jennybc / sanesheets

Multiple sheets in one file #3

Tables separated by blank lines

Example

Advantages

Disadvantages

Tables in Markdown

Example

MIME (Multipurpose Internet Mail Extensions)

Example

Advantages

Disadvantages

Directory of files

Example

Advantages

Disadvantages

JSON (JavaScript Object Notation)

Example

Advantages

Disadvantages

CSVY (CSV with YAML frontmatter)

Example

Advantages

Disadvantages

eg.sanesheet

read_sanesheet.r

read_sanesheet.sh