sematic-ai / sematic

An open-source ML pipeline development platform
Other
969 stars 58 forks source link

Capability of mounting underlying node paths in the pod #1064

Closed tscurtu closed 11 months ago

tscurtu commented 11 months ago

Introduces the capability of mounting paths from the underlying node filesystem into the Worker pod.

This is added as a new ResourceRequirements parameter.

The documentation is updated to cover the new capability, and to cover other missing capabilities.

Testing

The unit tests are updated to cover the new functionality:

$ bazel test //sematic/scheduling/tests:test_kubernetes
$ bazel test //sematic/resolvers/tests:test_resource_requirements

The Testing Pipeline is enhanced with a new parameter to test mounting paths, and writing to and reading from them inside the Worker pods. The existing ResourceRequirements testing, which only covered shared memory mount expansion, is also refactored, and enhanced to check its effects.

$ bazel run //sematic/examples/testing_pipeline:__main__ -- \
    --cloud \
    --expand-shared-memory \
    --mount-host-path /tmp /test \
    --mount-host-path /tmp /test2

Expanded shared memory Function logs:

[...]
2023-10-04 21:55:25,906 - INFO - sematic.examples.testing_pipeline.pipeline: Executing: add_with_expanded_shared_memory(a=3.0, b=3.0)
2023-10-04 21:55:25,906 - INFO - sematic.examples.testing_pipeline.pipeline: System memory capacity: 7834 MB
2023-10-04 21:55:25,906 - INFO - sematic.examples.testing_pipeline.pipeline: Root partition size: 102387 MB
2023-10-04 21:55:25,906 - INFO - sematic.examples.testing_pipeline.pipeline: Shared memory partition size: 7094 MB
[...]

Validated manually that this would print 64 MB for the shared memory when the flag is not actually set.

Host path mounts Function logs:

[...]
2023-10-04 21:55:24,685 - INFO - __main__: Executing add_with_host_path_mounts
2023-10-04 21:55:24,685 - INFO - sematic.examples.testing_pipeline.pipeline: Executing: add_with_host_path_mounts(a=3.0, b=3.0, pod_mount_paths=['/test', '/test2'])
2023-10-04 21:55:24,685 - INFO - sematic.examples.testing_pipeline.pipeline: Contents of '/test':
total 4
drwxrwxrwt 8 root root 189 Oct 4 21:55 .
drwxr-xr-x 1 root root 76 Oct 4 21:55 ..
drwxrwxrwt 2 root root 6 Mar 22 2023 .ICE-unix
drwxrwxrwt 2 root root 6 Mar 22 2023 .Test-unix
drwxrwxrwt 2 root root 6 Mar 22 2023 .X11-unix
drwxrwxrwt 2 root root 6 Mar 22 2023 .XIM-unix
drwxrwxrwt 2 root root 6 Mar 22 2023 .font-unix
-rw-r--r-- 1 root root 14 Oct 4 21:55 sammy.txt
drwx------ 3 root root 17 Jun 13 14:49 systemd-private-04155cef5fe44407b7976cb03ff52b97-chronyd.service-fBEr7S
2023-10-04 21:55:24,706 - INFO - sematic.examples.testing_pipeline.pipeline: Contents of '/test2':
total 4
drwxrwxrwt 8 root root 189 Oct 4 21:55 .
drwxr-xr-x 1 root root 76 Oct 4 21:55 ..
drwxrwxrwt 2 root root 6 Mar 22 2023 .ICE-unix
drwxrwxrwt 2 root root 6 Mar 22 2023 .Test-unix
drwxrwxrwt 2 root root 6 Mar 22 2023 .X11-unix
drwxrwxrwt 2 root root 6 Mar 22 2023 .XIM-unix
drwxrwxrwt 2 root root 6 Mar 22 2023 .font-unix
-rw-r--r-- 1 root root 14 Oct 4 21:55 sammy.txt
drwx------ 3 root root 17 Jun 13 14:49 systemd-private-04155cef5fe44407b7976cb03ff52b97-chronyd.service-fBEr7S
2023-10-04 21:55:29,714 - INFO - __main__: Finished executing add_with_host_path_mounts
[...]
pwais commented 11 months ago

haha i just did exactly this feature, almost the same way https://github.com/pwais/sematic/blob/e311b10923ae40fa2c070dab5a9cf2f121112c0a/sematic/resolvers/resource_requirements.py#L247

will there be a feature eventually for allowing pods to use persistent volume claims? and/or a CSI to provision storage? for PVCs, that might be a slightly cleaner / more explicit way to use host path volumes. for a CSI, then sematic job pods can start requesting block storage in a more cloud-agnostic way. e.g. there is CSI for NFS, CIFS etc.

also aside: i think a very common solution to the pytorch-needs-more-shm problem is not to expand the container shm but rather just use ipc=host or hostIPC: true in k8s-talk. personally after i added volume support, i was gonna try next to add hostIPC: true as a ResourceRequest or something ...

pwais commented 11 months ago

Actually i would disagree that a PV / PVC be required. It's super useful to just mount hostpath for a variety of things, and it's part of K8S for good reason.

PV / PVC you have to create them in the cluster, and that can sometimes be hard to set up right to work wit auto-scaling.