issues
search
guidebooks
/
store
The home for importable Guidebooks
1
stars
10
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
fix: remove Items field
#791
Sara-KS
closed
10 months ago
0
feat: Update to mcad v1.34.1 support and torchx 0.6.0
#790
Sara-KS
closed
10 months ago
0
fix: more EOF protection fixes
#789
starpit
closed
1 year ago
0
Update pvc.yaml - add diskfree parameter
#788
ykoyfman
closed
1 year ago
0
fix: ray head init container should print a message when it is done waiting for workers
#787
starpit
closed
1 year ago
0
fix: cpu utilization information may be bogus; switch to cgroup-based stats
#786
starpit
closed
1 year ago
0
fix: increase max log requests for app logs
#785
starpit
closed
1 year ago
0
fix: ray head wait-for-workers initContainer should retry if wait fails
#784
starpit
closed
1 year ago
0
fix: multinic detection was broken; also was hard-wiring name of resource
#783
starpit
closed
1 year ago
0
fix: custodian logs container fails due to unescaped $ in $TAIL
#782
starpit
closed
1 year ago
0
fix: cache ray/torchx helm chart
#781
starpit
closed
1 year ago
0
fix: improve torchx support for running multiple gpus per pod
#780
starpit
closed
1 year ago
0
feat: add some NCCL tweaks
#779
starpit
closed
1 year ago
0
fix: syntax error in multinic for torchx
#778
starpit
closed
1 year ago
0
feat: add multinic support
#777
starpit
closed
1 year ago
0
fix: ray wait for workers initContainer not needed with 0 workers
#776
starpit
closed
1 year ago
0
fix: use initContainer to wait for ray workers
#775
starpit
closed
1 year ago
0
fix: increase ray gcs rpc timeout to 30s
#774
starpit
closed
1 year ago
0
fix: more EOF resiliency fixes for ray and torchx
#773
starpit
closed
1 year ago
0
fix: increase torchx log streaming resilience to network disconnects
#772
starpit
closed
1 year ago
0
fix: wait for ray workers prior to server-side job submit
#771
starpit
closed
1 year ago
0
fix: restore helm delete and increase resilience to network disconnects
#770
starpit
closed
1 year ago
0
fix: avoid helm delete in custodian for now
#769
starpit
closed
1 year ago
0
Revert "fix: avoid use of all-containers in ray log streamer"
#768
starpit
closed
1 year ago
0
fix: all-containers fix should async app logs and sync on ray head logs
#767
starpit
closed
1 year ago
0
Revert "fix: avoid use of all-containers in ray log streamer"
#766
starpit
closed
1 year ago
0
fix: avoid use of all-containers in ray log streamer
#765
starpit
closed
1 year ago
0
fix: increase memory for runtime-env custodian pod
#764
starpit
closed
1 year ago
0
fix: increase memory for ray head logs container
#763
starpit
closed
1 year ago
0
fix: torchx volume mount paths have extra quotes
#762
starpit
closed
1 year ago
0
fix: remove reliance on wget in ray head container
#761
starpit
closed
1 year ago
0
fix: improve custodian memory requests for larger jobs
#760
starpit
closed
1 year ago
0
fix: ignore __pycache__ when bundling up workdir
#759
starpit
closed
1 year ago
0
fix: improve support for pytorch lightning's fsspec[s3] support
#758
starpit
closed
1 year ago
0
fix: do not create gpu custodian container for non-gpu runs
#757
starpit
closed
1 year ago
0
fix: lower memory requests for some of the custodian pods
#756
starpit
closed
1 year ago
0
chore: move custodian to ml/codeflare/custodian
#755
starpit
closed
1 year ago
0
fix: add worker-status to custodian
#754
starpit
closed
1 year ago
0
fix: add runtime-env-setup to custodian
#753
starpit
closed
1 year ago
0
chore: remove old untested 'in-cluster' log aggregator
#752
starpit
closed
1 year ago
0
fix: eliminate newlines from base64
#751
starpit
closed
1 year ago
0
feat: add gpu utilization pod to custodian
#750
starpit
closed
1 year ago
0
feat: add memory utilization pod to custodian
#749
starpit
closed
1 year ago
0
feat: add cpu utilization pod to custodian
#748
starpit
closed
1 year ago
0
fix: use multi-line yaml to improve formatting of logs args
#747
starpit
closed
1 year ago
0
fix: lower custodian logs container 100m/128Mi -> 50m/32Mi
#746
starpit
closed
1 year ago
0
fix: clean up custodian command, and rename container 'logs'
#745
starpit
closed
1 year ago
0
fix: torchx cluster name may end with a dash
#744
starpit
closed
1 year ago
0
fix: owner label default needs to be quoted
#743
starpit
closed
1 year ago
0
fix: add app.kubernetes.io/owner label to pods
#742
starpit
closed
1 year ago
0
Next