Data Store information API for data locality scheduling purposes

Description

Add a support for data store information in torchx.specs.api.Role with the following hierarchy:

@dataclass
class DataStore:
    namespace: str
    tables: List[DataStoreTable]

@dataclass
class DataStoreTable:
    partition_names: List[str]
    filters: List[DataStoreFilter]

@dataclass
class DataStoreValueFilter:
    values: Optional[Union[str, ValueRange]]

@dataclass
class ValueRange:
    min: Optional[str]
    max: Optional[str]
    min_inclusive: bool
    max_inclusive: bool

Motivation/Background

It is computationally efficient to allocate compute resources that use persistent data closer to storage location. For example Slurm can have a multi-cluster configuration and AWS has geographic regions that incur data transfer cost between regions.

Detailed Proposal

In order for scheduler to select a preferred cluster/region, it is required for the client code to provide this information upfront in order to allocate jobs to use the right resources. Implementation can be done either in the scheduler wrapper or pass the information to actual scheduler if it supports the operation.

API assumes use of widely adopted Hive Data Model, primarily use of:

Tables, and
Partitions,

where all the data within a partition is collocated.

Further, to select a right partitions the proposed API includes provides mechanism for selecting specific partitions either by a specific value or a range mechanism.

Alternatives

Instead of adding the changes to API, it is possible to build custom components that are data and region aware based on specific needs.

From offline discussion my understanding is that DataStore is intended to support hive paths though I think we probably want to support other things as well such as:

cloud storage (i.e. gcs/s3)
OLTP - relational databases? (host + sql table?)
OLAP - spark / snowflake clusters?

Some example data paths that might come up:

s3://bucket/path
hive://namespace/table/ts=1234-56-78/partition=yes/subfilter=blah
mysql://user:pass@localhost/db

For DataStoreValueFilter is that sufficient to handle arbitrary subfilters?

I'm wondering if it might be better to go the route of DeviceMounts and use Python inheritance w/ a DataStore base class. That does have issues for JSON serialization though

https://github.com/pytorch/torchx/blob/main/torchx/specs/api.py#L306

pytorch / torchx