pytorch / torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
https://pytorch.org/torchx
Other
332 stars 110 forks source link

Tracker with custom S3 (Minio) #680

Closed ghpu closed 1 year ago

ghpu commented 1 year ago

🐛 Bug

Following the tracker example app, In fsspec_backend.conf, we can specify :

protocol=s3
root_path=s3://my-bucket
key=***
secret=**

In order to use Minio, one needs to pass client_kwargs with "endpointUrl" member as a struct :

client_kwargs = {"endpointUrl":"http://myminio:9000"}

But the client_kwargs argument needs to be decoded as an object, not as a str : File "/lib/python3.10/site-packages/s3fs/core.py", line 361, in set_session client_kwargs = self.client_kwargs.copy() AttributeError: 'str' object has no attribute 'copy'

Module (check all that applies):

Environment info

PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.10 (x86_64)
GCC version: (Ubuntu 12.2.0-3ubuntu1) 12.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.36

Python version: 3.10.7 (main, Nov 24 2022, 19:45:47) [GCC 12.2.0] (64-bit runtime)
Python platform: Linux-5.19.0-28-generic-x86_64-with-glibc2.36
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] botorch==0.6.0
[pip3] gpytorch==1.9.1
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.24.1
[pip3] pytorch-lightning==1.5.10
[pip3] torch==1.13.1
[pip3] torch-model-archiver==0.7.0
[pip3] torchmetrics==0.10.3
[pip3] torchserve==0.7.0
[pip3] torchtext==0.14.1
[pip3] torchvision==0.14.1
[pip3] torchx==0.4.0
[conda] Could not collect

Versions of CLIs:
AWS CLI: N/A
gCloud CLI: None
AZ CLI: None
Slurm: N/A
Docker: 20.10.16, build 20.10.16-0ubuntu1
kubectl: None

torchx dev package versions:
aiobotocore:2.1.0
black:22.3.0
boto3:1.20.24
botorch:0.6.0
captum:0.6.0
flake8:3.9.0
google-api-core:1.34.0
google-cloud-batch:0.7.0
google-cloud-logging:3.4.0
google-cloud-runtimeconfig:0.33.2
gpytorch:1.9.1
hydra-core:1.3.1
ipython:8.8.0
kfp:1.8.9
kfp-pipeline-spec:0.1.17
kfp-server-api:1.8.5
moto:3.0.2
Pygments:2.14.0
pyre-extensions:0.0.21
pytest:7.2.0
pytorch-lightning:1.5.10
requests:2.27.1
requests-oauthlib:1.3.1
requests-toolbelt:0.10.1
strip-hints:0.1.10
torch:1.13.1
torch-model-archiver:0.7.0
torchmetrics:0.10.3
torchserve:0.7.0
torchtext:0.14.1
torchvision:0.14.1
torchx:0.4.0
traitlets:5.8.1
ts:0.5.1
usort:1.0.2

torchx config:
# generated by running
# cd ~/fbsource/fbcode/torchx/fb/example
# torchx configure --all --schedulers local_cwd,mast,flow

[local_cwd]
log_dir = None
prepend_cwd = False

[mast]
hpcClusterUuid = MastProdCluster
runningTimeoutSec = None
hpcIdentity = pytorch_r2p
hpcJobOncall = pytorch_r2p
useStrictName = False
mounts = None
localityConstraints = None
enableGracefulPreemption = False

[flow]
secure_group = pytorch_r2p
entitlement = default
proxy_workflow_image = None

[cli:run]
component = fb.dist.hpc

# TODO need to add hydra to bento_kernel_torchx and make that the default img
[component:fb.dist.ddp]
img = bento_kernel_pytorch_lightning
m = compute_world_size/main.py

[component:fb.dist.ddp2]
img = bento_kernel_pytorch_lightning
m = compute_world_size/main.py

[component:fb.dist.hpc]
img = bento_kernel_pytorch_lightning
m = compute_world_size/main.py

[torchx:tracker]
fsspec=fsspec

[tracker:fsspec]
config=file://fsspec_backend.conf
kurman commented 1 year ago

Hi @ghpu, thank you for filing the issue.

I see this is the feature that we don't have yet. (In fact nested config reminds me of OmegaConf approach to define nested properties via denormalized keys: https://omegaconf.readthedocs.io/en/2.3_branch/usage.html#from-command-line-arguments)

I would like to revamp the config approach in additional larger changes that are planned. Do you mind sharing your timelines so I can plan those changes better?

ghpu commented 1 year ago

For now, I have cloned the FsspecTracker backend, so I can live without this feature and patiently wait :-)

kiukchung commented 1 year ago

This looks like a pretty simple change in the _read_config() method in fsspec tracker. Here's the pull request that ought to do it without breaking BC (https://github.com/pytorch/torchx/pull/681). Basically what I did was to enable nested configs as "." delimited flat keys. For instance in this case the endpointUrl could be specified as

protocol=s3
root_path=s3://my-bucket
key=***
secret=**
client_kwargs.endpointUrl=http://myminio:9000

Which would be read as kwargs:

{
  "protocol": "s3",
  "root_path": "s3://my-bucket",
  "key": "***",
  "secret": "***",
  "client_kwargs": {
       "endpointUrl": "http://myminio:9000"
   }
}