traitecoevo / datastorr

Simple data versioning and distribution
https://docs.ropensci.org/datastorr
Other
67 stars 4 forks source link

Feature request: support having multiple data files associated with a release, and be able to fetch a specific one #6

Open daattali opened 8 years ago

ateucher commented 8 years ago

Big :+1: from me too. I'm working on a data package where I would love to use datastorr to fetch the data, but each file is big and I don't want a user to have to download all just to use one.

richfitz commented 8 years ago

I will try and work up a prototype for this and see if it works. It should be possible with the current implementation with the changes that:

This would actually allow sharding a very large file into 2GB chunks, in addition to the use case with multiple files per release.

thibautjombart commented 8 years ago

+1

richfitz commented 8 years ago

OK, thinking that the configuration could look like this:

{
    "read": "base::readRDS",
    "filename": {
        "iris": "iris.rds",
        "mtcars": "mtcars.rds"
    },
    "index": "index.json"
}

where the files will be given as an associative array so that different files can be retrieved by name. Because in the multi-file case it is likely that some files across releases will have the same hash I'm thinking that an index (mapping filename -> hash in each release) could help reduce bandwidth consumption. Storage is already content addressable so transfer times is the only advantage here.

As for the functions:

datastorr("user/repo")

becomes:

datastorr("user/repo", which="mtcars")

With no argument for which (open to a better name so long as it does not mention file) we'd download all files and return as a named list.

This should address the multiple file use-case as well as a sharding very large files use-case.

Thoughts?

daattali commented 8 years ago

Maybe this is assumed, but just to clarify -- you should be able to call

datastorr("user/repo", which = c("mtcars", "iris"))

Right?

richfitz commented 8 years ago

I hadn't actually thought about that but you are right - that seems worth supporting.

However, that leaves things in the terrible sapply/strsplit situation - because R lacks true scalars, is the argument to which a single element (in which case we'd return the bit of data there) or is it a vector that happens to be length 1 (in which case we should return a list of length 1).

I really like the idea of being able to depend on the return type, so am thinking this can be solved with two different, but mutually exclusive, arguments. name and names are probably decent argument names here;

This also suggests an additional function, say datastorr_contents is needed which will return the names of all datasets stored in a release.

daattali commented 8 years ago

Yeah, that notion of no scalar value is quite annoying at times. I don't think there's an ideal answer to this question, it's probably a design decision that you're going to have to call ultimately.

I think I'd actually prefer the sapply approach of automatically reducing the data to one dimension if possible by default (and provide an argument to allow the user to always get a list back even if the input had length 1). It just sounds less weird to me than having two competing mutually exclusive arguments that are so similar. But I'm fairly new in the R world so perhaps you know better than me and it's been agreed upon already by experts that this approach should not be used? Intuitively, to me that seems to be more user friendly to the average person. And if I've learned anything about R, it's that it seems to value "what the user is likely to want" over "technical correctness" :)

datastorr_contents sounds like a good idea (feature creeping, here we come!)

richfitz commented 8 years ago

I'm really not sure what the path of least terribleness there is. The drop argument to [ came from that line of thinking and it's one of my least favourite R quirks.

The next issue to consider is that the names of files will (probably) change across releases. So I think I will make the index.json compulsory but automatically generated. That will limit the chance of mismatch.

So we'd have:

{
    "repo": "richfitz/data2",
    "read": "base::readRDS",
    "index": "index.json"
}

and the index containing

{
    "iris": {"filename": "iris.rds", "hash": "<hash>"}
}
thibautjombart commented 8 years ago

A 2-cents on the scalar issue: I suspect most of the time, if the argument is of length 1, users will want the object, rather than a list of length 1 containing it; and they will expect a list whenever length(which)>1. I'd find it a bit weird to have an argument name and another names, but you could have another argument simplify (defaulting to TRUE) which triggers this behaviour?

ateucher commented 8 years ago

I don't know if you need another opinion here, but I'd also vote for having only a single which argument. I am also not opposed to it always returning a list, even if it is of length 1 when there is only one file being downloaded - as @richfitz said, it's nice to be able to depend on the return type. I like @thibautjombart 's suggestion of a simplify argument for when there is only one dataset (I'd vote for the default being FALSE, but my opinion on this is not strong).

noamross commented 8 years ago

It occurs to me that one might have a data set with files of multiple types, and that you'll want have a separate "read": field for each file. Perhaps the main read function should be used by default, unless one is specified for the file?