substantic / rain

Framework for large distributed pipelines
https://substantic.github.io/rain/docs/
MIT License
748 stars 54 forks source link

Python API for data type in data objects #18

Closed spirali closed 6 years ago

spirali commented 6 years ago

Python API for data type in data objects

In the current version, data instance has data types with possible values: blob, directory. After some discussions it seems that it makes sense to put data types also into data objects, i.e. task graph already contains information about data type.

This is a proposal how to handle this in Python API. This is relevant for Python tasks and tasks.execute + Program class.

Now, the user indicates that output is directory by setting content_type to 'dir', e.g.:

Output("mydata", content_type="dir")

Here we propose to introduce OutputDir class indicate directory output, and reserve `Output for blob data type. Both can be used in the remote python tasks, tasks.execute, and Program.

Output("mydata")  # blob
OutputDir("mydata")  # directory

To make it symetric, we can also introduce Input/InputDir for blob/directory inputs. Strictly speaking it is not necessary as data type may be obtain from provided object (and in case Program, decision of data type may be postponed). However, the ideas is to provide additional level of "type" check:

Input("mydata", dataobj=d)  # Fail if 'd' is data object of directory type
InputDir("mydata", dataobj=d)  # Fail if 'd' is data object of blob type

In the case of implicit input, right data type is derived from provided data object:

tasks.execute(["du", "-h", d])  # This will work for 'd' being directory or blob

Alternatives

Input("mydata", data_type=DataType.Blob)
Output("mydata", data_type=DataType.Directory)
spirali commented 6 years ago

Implemented in #19