Open sjackman opened 8 years ago
See http://johnkerl.org/miller/doc/file-formats.html#CSV/TSV/etc.
# name: sheet1.tsv
A B
1 X
2 Y
# name: sheet2.csv
C,D,E
3,4,5
# name: sheet3.md
| F | G |
|---|---|
| 0 | 1 |
mlr
) supports this file format, though without #
comments and they all have to be the same type (all TSV for example).#
comments in TSV/CSV files.See https://help.github.com/articles/organizing-information-with-tables/
# Tables in *Markdown*
```tsv sheet1
A B
1 X
2 Y
C,D,E
3,4,5
Table: A pandoc-crossref table caption {#tbl:sheet3}
F | G |
---|---|
0 | 1 |
## Advantages
1. The container format (Markdown) is already defined!
2. Markdown editors already exist! Texts http://www.texts.io can edit Markdown tables.
3. Need to teach programs like `mlr`, `datamash`, `readr`, `RStudio` et c to read Markdown files and extract tables
## Disadvantages
1. The container format is Markdown, but it can't really contain a Markdown file.
Do you think it has to be one file? What if, instead, it were one directory? The way that Git repos or RStudio projects are just regular directories on your computer, with some other, mostly hidden stuff lying around to help Git and RStudio do their jobs. An R package is a special case that has even more specific structure about the files and directories. Maybe that's a model for a sanesheet?
A sanesheet-anticipating tool serves the different files to you via conventional tabs. And tabs have an interact / edit mode appropriate to the file type. As delimited file for the TSVs, as markdown for the notes. So I guess that suggests there has to be a .sanesheets file, like foo.Rproj or the entire .git directory, to facilitate coordinated actions across the whole set of files.
I think the point is that someone who wants to experience a sanesheet like a spreadsheet could. Well, if this mythical inpsector/editor existed! But the code-based analyst could experience it like a subdirectory of delimited files and some notes. And all could enjoy the benefit of version control.
See https://en.wikipedia.org/wiki/MIME
MIME-Version: 1.0
Content-type: text/tab-separated-values; name=sheet1.tsv; boundary=END
A B
1 X
2 Y
--END
Content-type: text/csv; name=sheet2.csv; boundary=END
C,D,E
3,4,5
--END
Content-type: text/markdown; name=sheet3.md; boundary=END
| F | G |
|---|---|
| 0 | 1 |
--END
brew install ripmime && ripmime -i example.sanesheet
will extract three files from this one file.mlr
, datamash
, readr
, RStudio
et c to read MIME files.Multi-part MIME is pretty intriguing.
I've always wanted to interact with spreadsanesheets via my mail client 😉.
❯❯❯ ls
sheet1.tsv sheet2.csv sheet3.md
❯❯❯ head *
==> sheet1.tsv <==
A B
1 X
2 Y
==> sheet2.csv <==
C,D,E
3,4,5
==> sheet3.md <==
| F | G |
|---|---|
| 0 | 1 |
I like "Tables separated by blank lines" for its simplicity, but I believe it has the least support (of the options given) by programs in the wild.
I like "Tables in Markdown" for its aesthetic qualities and existing adoption. Markdown editors already exist, and I believe some already have support for editing tables. This option is probably the most well implemented by existing tools.
I like MIME because the format is simple, well defined and widely implemented. There's lots of existing code/libraries to read/write these files. It can contain binary files, like graphical plots.
Do you think it has to be one file? What if, instead, it were one directory?
A directory of files certainly has the lowest barrier to entry.
I'd like to bundle up multiple files into a single container file so that they're not easily separated, so that for examples when someone e-mails you the data sheet it doesn't get easily separated from the metadata sheet.
People are accustomed to a single spreadsheet file (XLS for example) that contains all the related sheets of one project. I'd like to give people an option that has this same behaviour, but with a plain text file format.
Twitter poll! https://twitter.com/sjackman/status/776860608629055488
{
"sheet1": {
"A": [1, 2],
"B": ["X", "Y"]
},
"sheet2": {
"C": [3],
"D": [4],
"E": [5]
},
"sheet3": {
"F": [0],
"G": [1]
}
}
jq
and mlr
.When you are ready for provenance, attribution, and extensible metadata then http://www.researchobject.org/ will be waiting for you :-)
A Research Object Bundle is a zip file of arbitrary files (like TSV files) that also contains a JSON file describing the provenance/attribution of those files.
https://researchobject.github.io/specifications/bundle/
I'm hoping for a plain-text format that can be committed to git. The unzipped zip file can of course be committed to git, in which case we have a "Directory of files" with an additional .ro/manifest.json
file to describe the files.
@jennybc @hadley Does RStudio have a table editor to edit data frames and TSV files?
No ... but maybe it could one day!
See http://csvy.org
---
name: sheet1.tsv
---
A B
1 X
2 Y
---
name: sheet2.csv
---
C,D,E
3,4,5
---
name: sheet3.md
---
| F | G |
|---|---|
| 0 | 1 |
sed '/^---$/,/^---$/d'
@jennybc Any thoughts/preferences on these file formats? My favourites are
I really like the look of TSVY and its similarity to Markdown with YAML front matter. I quite like the idea of storing the code in RMarkdown with YAML frontmatter and storing the data in TSV with YAML frontmatter. I could even see concatenating the code and data files for the occasions where you may want to store both in a single file, and then I think you may have a real competitor for an Excel spreadsheet.
I have JSON filed along with 😱 XML in my head. As in, if I don't have a nested or recursive structure, why would I go there? It's also not that human readable. For normal humans.
So I like these tab and comma delimited formats with some meta-data in a header, yes. For the bits that are data.
I think it is important to think about the motley assortment of things people park in a spreadsheet. I honestly think the neat packaging of disparate objects is part of what users like. I see a fair number of spreadsheets where the data worksheets really could be csv files. But then there's always the "README as worksheet" lurking at the front or the back.
Yes, I agree with needing a format to wrap up prose, code, data and report in a single file. Currently that lives in 3+ files: the prose and code in one RMarkdown file, the data in multiple TSV files, and the the rendered report in an HTML file. I'd like a format to stuff all that in a single text file. I prefer having multiple files when building a data analysis pipeline, but when sending a report to a collaborator, I prefer a single file.
This is why I think your multipart MIME idea is not crazy. This is also why xlsx is a zip archive. Unless people can get comfortable with a directory, you have to shove it all into something 😕.
I like the MIME idea for storing files of different types all in a single text file: the RMarkdown, the data and the HTML. For the related but simpler problem of how to stuff multiple sheets (TSV tables) in a single file, I prefer the TSV tables separated by YAML frontmatter blocks.
Do you think it's necessary to stuff multiple sheets in a single file? Why?
See this Twitter conversation thread: https://twitter.com/BaCh_mira/status/777511003017867264
@BaCh_mira
Directories are problem when people/scripts forget to copy/ftp parts. Specially when /* contains more than they want.
@mike_schatz
tarball?
@BaCh_mira
Then program has to navigate tarball. Back to square one. Or have tar/untar copies: not cool w/ large files (100GiB)
Oh yes I definitely think this whole bundle of files needs to be packaged in some way yes. I thought you meant that, within that main receptacle, you wanted to get all the TSV files into one file. I misunderstood.
This is why the MIME idea is interesting, because it already anticipates very disparate things, with a pre-existing vocabulary for declaring what things are, signalling where they begin/end, etc.
In the case when you're bundling all different types of files (TSV, RMarkdown and HTML), I like the MIME format. The TSV in that MIME file can be just simple TSV without YAML blocks.
In the case when you're bundling just TSV files (no RMardkon, no HTML) I prefer one file of TSV tables separated by YAML blocks (no MIME).
If we only cared about bundling tables (sheets) into a single file, I'd prefer the TSV/YAML solution. If we want to tackle the whole enchilada, MIME is looking good.
Success reading a multipart MIME sanesheet! So excite! Are we on to something?
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary=END
Title: An example sanesheet
--END
Content-Disposition: file; name="sheet1"; filename="sheet1.tsv"
A B
1 X
2 Y
--END
Content-Disposition: file; name="sheet2"; filename="sheet2.tsv"
C D E
3 4 5
--END--
library(purrr)
library(readr)
library(webutils)
multipart <- parse_multipart(read_file("sanesheet.tsv.multipart"), boundary = "END")
sheets <- map(multipart, function(x) read_tsv(x$value))
sheets
$sheet1
A B
1 1 X
2 2 Y
$sheet2
C D E
1 3 4 5
This sanesheet can also be extracted to one-file-per-sheet at the command line using ripmime
.
❯❯❯ brew install ripmime
❯❯❯ ripmime -i eg.sanesheet -d eg
❯❯❯ head eg/*
==> eg/sheet1.tsv <==
A B
1 X
2 Y
==> eg/sheet2.tsv <==
C D E
3 4 5
==> eg/textfile0 <==
Title: An example sanesheet
@jennybc What do you think of the name and file extension .tidysheet
?
I really think that multiple files is the only sane way to go. Would be better treat like an RStudio project, with one metadata file that people click on to launch the whole business. Then you could also mingle in other files (like R scripts etc).
It's difficult to e-mail a directory files. I don't have any inside scoop, but I think that's why Apple migrated .pages, .numbers et c from their directories of files (bundles) to flat files, and that's even with special OS support to make the directory look to the user like a single file. A directory of files will eventually need to be zipped up to send to someone. A more likely outcome is that someone e-mail the data sheet and just leave out the metadata file, and the two will be separated.
See this Twitter conversation thread: https://twitter.com/BaCh_mira/status/777511003017867264 https://github.com/jennybc/sanesheets/issues/3#issuecomment-248465568
ODS and XLSX use a zip file of files as the file format. Michael @mr-c mentioned http://www.researchobject.org/, which is a zip of files. zip as a container format is alright, but it's binary and can't easily be edited with a text editor or committed to version control. MIME is a nice container format because it is plain text, and it faithfully represents a directory of files.
To woo/convert spreadsheet users, one key feature currently missing is the ability to store multiple sheets in a single file. I'd like the format of that file to be plain text.
Yes, that's a downside, but I think the downsides of the other options (i.e. one massive opaque file that requires special software to read) are worse.
Zip files are quite opaque, which is why I'm not a big fan of zip as a container format for plain text files. MIME is quite readable, as far as standard plain-text container formats go. What exactly do you mean by opaque? The contents of the MIME file example in https://github.com/jennybc/sanesheets/issues/3#issuecomment-248517695 are pleasingly transparent in my opinion.
They are pleasingly transparent to you as a programmer, but what existing tools can easily extract data out of a file of that nature?
The structure of MIME also does not lead itself to good performance - if you have 200 meg csv file followed by a markdown readme, you'll have to scan all 200 megs of lines to find the md file.
People don't seem to have issues sharing RStudio projects?
They are pleasingly transparent to you as a programmer, but what existing tools can easily extract data out of a file of that nature?
ripmime
for the shell and webutils::parse_multipart
for R to name two.
rio::import
can read a zip of TSV files. I'd be up for submitting a PR to add support for MIME as a container.
The structure of MIME also does not lead itself to good performance - if you have 200 meg csv file followed by a markdown readme, you'll have to scan all 200 megs of lines to find the md file.
That's also true of a tar.gz
of files. I'm not sure whether zip
is indexed. You can get random access to individual files by extracting the MIME container, same as with tar.gz
and zip
of a directory.
People don't seem to have issues sharing RStudio projects?
People with the technical ability to create RStudio projects don't have an issue sharing RStudio projects. I'm hoping to target spreadsheet users with two key features:
If you use a single file format, you also need to expose UI for everything you can do with your file browser: delete sheets, copy sheets from another project, ...
Also for anything other than trivial data, you'll need to compress the contents in order to email, so that means you'll need a layer on top of mime.
Also for anything other than trivial data, you'll need to compress the contents in order to email, so that means you'll need a layer on top of mime.
Gmail's file size attachment limit is 25 MB. That's good enough I would hazard for many typical Excel spreadsheets without being compressed (though XLSX is compressed). Larger files can be transferred via a link on Dropbox, same for any data file larger than the e-mail attachment limit.
If you use a single file format, you also need to expose UI for everything you can do with your file browser: delete sheets, copy sheets from another project,
I don't feel that's onerous. Only two operations are needed: create a new blank sheet and delete a sheet. Good old copy-and-paste can be used to copy a sheet between two projects.
Texts (http://www.texts.io) can edit Markdown tables.
rio::import
will be able to read multiple tables from an HTML file once https://github.com/leeper/rio/issues/126 is resolved.
@hadley: I really think that multiple files is the only sane way to go. Would be better treat like an RStudio project, with one metadata file that people click on to launch the whole business. Then you could also mingle in other files (like R scripts etc).
@sjackman Sorry to jump in so late, but Tabular Data Packages might be an option here. You can store multiple TSVs in a single directory and define a datapackage.json
that provides high-level metadata (e.g. license info); a Table Schema that defines column descriptions, types, and constraints (csvy also uses this); and information about the CSV dialect. For serialization into one file, we have an open issue here: https://github.com/frictionlessdata/specs/issues/132 . Version 1.0 coming super soon.
You may be interested in this new project Data Curator. We're building on top of Data Packages, Comma Chameleon and other goodness. Development starting this week.
That's very exciting! Thanks for the heads up, Stephen. I'll be a willing beta tester, if you're looking for early outside users.
Hope to go Beta at release 0.3.0 should be of some value at that stage for single data files. Everyone is very welcome to test and report issues.
Excellent. I don't know of a way to watch releases or completed milestones in GitHub. Could you suggest an issue, or create an issue to which I could subscribe, that would be updated/closed when 0.3.0 is released?
Watch releases with https://github.com/ODIQueensland/data-curator/releases.atom
I and I imagine other users don't use an RSS feed reader as part of my workflow. I'd prefer to use the GitHub web interface to watch notifications. Would you consider creating a low-volume locked issue that you comment on once per release, so that users like myself can subscribe to that issue? See for example https://github.com/Linuxbrew/brew/issues/1#issuecomment-313844775
I'd like to have multiple sheets in one file to keep data and metadata together in one file. This suggests that we need a container format to keep multiple TSV files together in one file. We could use an off-the-shelf container format, such as tar or zip. I would like however for the container format to be a plain text format to enable editing the entire document in a text editor and checking it into version control. I'm going to some requirements below, and then a random assortment of formats that came to mind.