issues
search
kubeflow
/
mpi-operator
Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
417
stars
209
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
ttlSecondsAfterFinished for MPIJob, not only launcher
#644
hy00nc
opened
1 month ago
6
"cleanPodPolicy: All" does not clean up launcher pod
#643
hy00nc
opened
1 month ago
1
Connection reset
#642
bbenshab
closed
1 month ago
4
how could mpijob of mpi operator worker get the hostname of launcher
#641
Oneal65
closed
1 month ago
2
fix #639 provide NCCL tests example
#640
samos123
opened
2 months ago
1
NCCL tests example
#639
samos123
opened
2 months ago
1
Update image tag with 0.5
#638
tenzen-y
closed
2 months ago
2
Upgrade golang and controller-gen
#637
tenzen-y
closed
2 months ago
2
Upgrade golang and controller-gen
#636
alculquicondor
closed
2 months ago
9
Replace original pointer methods with ptr libs
#635
tenzen-y
closed
2 months ago
6
Introduce resource multiplication
#634
tenzen-y
closed
2 months ago
4
Upgrade K8s dependencies to v1.29
#633
tenzen-y
closed
2 months ago
12
Promote @tenzen-y to approver
#632
terrytangyuan
closed
2 months ago
2
Prepare for release 0.5.0
#631
alculquicondor
closed
2 months ago
5
Remove unnecessary RBAC rule for mpijobs-admin***
#630
vishvajit79
opened
3 months ago
2
Bump google.golang.org/protobuf from 1.31.0 to 1.33.0
#629
dependabot[bot]
closed
3 months ago
2
Fix: no overwrite when run launcher as worker
#628
kuizhiqing
closed
3 months ago
1
Deprecated pointer, use ptr instead
#627
kuizhiqing
closed
4 months ago
2
make namespace parsing and informers pluggable
#626
emsixteeen
opened
4 months ago
9
removing klog.Fatalf in favor of a shutdown request
#625
emsixteeen
closed
4 months ago
6
adding Mac .DS_Store to gitignore
#624
emsixteeen
closed
4 months ago
1
update auto gen file year to verify generate
#623
kuizhiqing
closed
4 months ago
2
Fix: add ns filter to podLister
#622
kuizhiqing
closed
4 months ago
3
Wrong host info in discover_hosts.sh
#621
kuizhiqing
closed
4 months ago
0
Running in a subset of namespaces
#620
emsixteeen
opened
4 months ago
8
Fails mpi-operator early if access to list or watch objects is denied
#619
emsixteeen
closed
4 months ago
8
adding timeout for cache sync
#618
emsixteeen
closed
4 months ago
14
fix the condition
#617
wang-mask
opened
5 months ago
12
change1 mv to cp
#616
wang-mask
closed
5 months ago
3
The operator still creates the launcher when launcherCreationPolicy is "WaitForWorkersReady" and suspend is "true"
#615
wang-mask
opened
5 months ago
0
"make generate" command run failed
#614
wang-mask
closed
5 months ago
0
Replace the plain pod workers with Indexed Job
#613
tenzen-y
opened
5 months ago
4
run worker process in launcher pod
#612
kuizhiqing
closed
4 months ago
31
Work with DeepSpeed for large scale training
#611
kuizhiqing
opened
6 months ago
28
add deepspeed example
#610
kuizhiqing
opened
6 months ago
5
Bump golang.org/x/crypto from 0.14.0 to 0.17.0
#609
dependabot[bot]
closed
6 months ago
2
When when WaitForWorkersReady is enabled in MPI operator, MPI operator and gang scheduler are in a deadlock
#608
yzhao-2023
opened
6 months ago
4
the object has been modified; please apply your changes to the latest version and try again
#607
gl-001
opened
7 months ago
8
fix bug about status absence when worker pod spec is invalid
#606
congpeiqing
opened
7 months ago
1
which is the latest mpi job definition between mpi-operator and training operator
#605
sxwl-donggang
closed
7 months ago
4
Cant get mpijob status when pod template is invalid
#604
congpeiqing
opened
7 months ago
9
Bumping opentelemetry libraries
#603
tenzen-y
closed
7 months ago
2
Bump go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc from 0.35.0 to 0.46.0
#602
dependabot[bot]
closed
7 months ago
4
Fix invalid link for horovod cpu-only example Dockerfile
#601
lianghao208
closed
7 months ago
2
Fix invalid link for horovod cpu-only example
#600
lianghao208
closed
8 months ago
1
Bump google.golang.org/grpc from 1.53.0 to 1.56.3
#599
dependabot[bot]
closed
7 months ago
2
MPI-Operator run example failed
#598
q443048756
opened
8 months ago
8
Bump go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp from 0.35.1 to 0.44.0
#597
dependabot[bot]
closed
7 months ago
3
Update stale examples
#596
jarulsamy
opened
8 months ago
2
Bump golang.org/x/net from 0.10.0 to 0.17.0
#595
dependabot[bot]
closed
8 months ago
1
Next