Think through how we will specify dataset types. Here is the first go at different places we might source data from:
A collection of files (or a single file) on a local disk
The file could be a root file or an awkward array file written to disk?
Files located on the grid (to be run on the grid or copied down first)
A python awkward array or numpy object (?)
And then there may be new sources eventually. Perhaps there are two classes of data set source:
File-Based
Object-Based
File-based requires special handling: the interface for all file types should be uniform. For example:
data = DataSet('name')
Where the name says if it is on the grid, if it should be run there, or if it is local, or anything. In the past, I've used a URI to specify this, as there is already a standard and it is possible to add parameters in a well-understood way (and that libraries are already written to parse!).
Think through how we will specify dataset types. Here is the first go at different places we might source data from:
And then there may be new sources eventually. Perhaps there are two classes of data set source:
File-based requires special handling: the interface for all file types should be uniform. For example:
Where the
name
says if it is on the grid, if it should be run there, or if it is local, or anything. In the past, I've used a URI to specify this, as there is already a standard and it is possible to add parameters in a well-understood way (and that libraries are already written to parse!).