pytorch / torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
https://pytorch.org/torchx
Other
332 stars 111 forks source link

Data Store information API for data locality scheduling purposes #530

Open kurman opened 2 years ago

kurman commented 2 years ago

Description

Add a support for data store information in torchx.specs.api.Role with the following hierarchy:

@dataclass
class DataStore:
    namespace: str
    tables: List[DataStoreTable]

@dataclass
class DataStoreTable:
    partition_names: List[str]
    filters: List[DataStoreFilter]

@dataclass
class DataStoreValueFilter:
    values: Optional[Union[str, ValueRange]]

@dataclass
class ValueRange:
    min: Optional[str]
    max: Optional[str]
    min_inclusive: bool
    max_inclusive: bool

Motivation/Background

It is computationally efficient to allocate compute resources that use persistent data closer to storage location. For example Slurm can have a multi-cluster configuration and AWS has geographic regions that incur data transfer cost between regions.

Detailed Proposal

In order for scheduler to select a preferred cluster/region, it is required for the client code to provide this information upfront in order to allocate jobs to use the right resources. Implementation can be done either in the scheduler wrapper or pass the information to actual scheduler if it supports the operation.

API assumes use of widely adopted Hive Data Model, primarily use of:

where all the data within a partition is collocated.

Further, to select a right partitions the proposed API includes provides mechanism for selecting specific partitions either by a specific value or a range mechanism.

Alternatives

Instead of adding the changes to API, it is possible to build custom components that are data and region aware based on specific needs.

d4l3k commented 2 years ago

From offline discussion my understanding is that DataStore is intended to support hive paths though I think we probably want to support other things as well such as:

Some example data paths that might come up:

For DataStoreValueFilter is that sufficient to handle arbitrary subfilters?

I'm wondering if it might be better to go the route of DeviceMounts and use Python inheritance w/ a DataStore base class. That does have issues for JSON serialization though

https://github.com/pytorch/torchx/blob/main/torchx/specs/api.py#L306