Open daattali opened 8 years ago
I will try and work up a prototype for this and see if it works. It should be possible with the current implementation with the changes that:
This would actually allow sharding a very large file into 2GB chunks, in addition to the use case with multiple files per release.
+1
OK, thinking that the configuration could look like this:
{
"read": "base::readRDS",
"filename": {
"iris": "iris.rds",
"mtcars": "mtcars.rds"
},
"index": "index.json"
}
where the files will be given as an associative array so that different files can be retrieved by name. Because in the multi-file case it is likely that some files across releases will have the same hash I'm thinking that an index (mapping filename -> hash in each release) could help reduce bandwidth consumption. Storage is already content addressable so transfer times is the only advantage here.
As for the functions:
datastorr("user/repo")
becomes:
datastorr("user/repo", which="mtcars")
With no argument for which
(open to a better name so long as it does not mention file) we'd download all files and return as a named list.
This should address the multiple file use-case as well as a sharding very large files use-case.
Thoughts?
Maybe this is assumed, but just to clarify -- you should be able to call
datastorr("user/repo", which = c("mtcars", "iris"))
Right?
I hadn't actually thought about that but you are right - that seems worth supporting.
However, that leaves things in the terrible sapply
/strsplit
situation - because R lacks true scalars, is the argument to which
a single element (in which case we'd return the bit of data there) or is it a vector that happens to be length 1 (in which case we should return a list of length 1).
I really like the idea of being able to depend on the return type, so am thinking this can be solved with two different, but mutually exclusive, arguments. name
and names
are probably decent argument names here;
name
, if given, must be a scalar, always returns the datasetnames
, if given, can be a vector of length zero or more, always returns a named listname
or names
can be givenThis also suggests an additional function, say datastorr_contents
is needed which will return the names of all datasets stored in a release.
Yeah, that notion of no scalar value is quite annoying at times. I don't think there's an ideal answer to this question, it's probably a design decision that you're going to have to call ultimately.
I think I'd actually prefer the sapply approach of automatically reducing the data to one dimension if possible by default (and provide an argument to allow the user to always get a list back even if the input had length 1). It just sounds less weird to me than having two competing mutually exclusive arguments that are so similar. But I'm fairly new in the R world so perhaps you know better than me and it's been agreed upon already by experts that this approach should not be used? Intuitively, to me that seems to be more user friendly to the average person. And if I've learned anything about R, it's that it seems to value "what the user is likely to want" over "technical correctness" :)
datastorr_contents
sounds like a good idea (feature creeping, here we come!)
I'm really not sure what the path of least terribleness there is. The drop
argument to [
came from that line of thinking and it's one of my least favourite R quirks.
The next issue to consider is that the names of files will (probably) change across releases. So I think I will make the index.json
compulsory but automatically generated. That will limit the chance of mismatch.
So we'd have:
{
"repo": "richfitz/data2",
"read": "base::readRDS",
"index": "index.json"
}
and the index containing
{
"iris": {"filename": "iris.rds", "hash": "<hash>"}
}
A 2-cents on the scalar issue: I suspect most of the time, if the argument is of length 1, users will want the object, rather than a list of length 1 containing it; and they will expect a list whenever length(which)>1
. I'd find it a bit weird to have an argument name
and another names
, but you could have another argument simplify
(defaulting to TRUE
) which triggers this behaviour?
I don't know if you need another opinion here, but I'd also vote for having only a single which
argument. I am also not opposed to it always returning a list, even if it is of length 1 when there is only one file being downloaded - as @richfitz said, it's nice to be able to depend on the return type. I like @thibautjombart 's suggestion of a simplify
argument for when there is only one dataset (I'd vote for the default being FALSE
, but my opinion on this is not strong).
It occurs to me that one might have a data set with files of multiple types, and that you'll want have a separate "read":
field for each file. Perhaps the main read function should be used by default, unless one is specified for the file?
Big :+1: from me too. I'm working on a data package where I would love to use datastorr to fetch the data, but each file is big and I don't want a user to have to download all just to use one.