issues
search
google
/
xpk
xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
67
stars
15
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Fix debug logging (--enable-debug-logs)
#183
Obliviour
closed
3 days ago
0
Fixes a typo in the base command description
#182
lukebaumann
closed
3 days ago
0
Fix debug logging
#181
Obliviour
closed
3 days ago
1
Add Zarr Flag for Pathways
#180
SujeethJinesh
closed
1 week ago
1
v6e device support
#179
Obliviour
opened
1 week ago
1
Added advanced usage example for a notebook interacting with a Cloud …
#178
nhira
closed
1 week ago
0
Enabling Workload Identity and GCSFuse driver flags added.
#177
sharabiani
closed
2 days ago
1
Pbundyra refactor commands
#176
PBundyra
closed
2 weeks ago
0
Create `commands` package and `core/` modules for NAP, Kueue and Pathways
#175
PBundyra
closed
2 weeks ago
0
Add quotes even to example output to help devs who copy commands from…
#174
nhira
closed
3 weeks ago
0
Add quotes even in example output to help devs who copy commands from the example output comments
#173
nhira
closed
3 weeks ago
0
Update RxDM image version from v1.0.8 to v1.0.9.
#172
yangyuwei
closed
3 weeks ago
0
Update RxDM image version from v1.0.8 to v1.0.9.
#171
yangyuwei
closed
3 weeks ago
1
Move SystemCharacteristics to a separate module
#170
PBundyra
closed
2 weeks ago
0
Allow debug_dump_gcs to be specified with other XLA_FLAGS
#169
jonb377
opened
4 weeks ago
0
Create `parser` package. Move logic from `xpk.py` to `parser` package.
#168
PBundyra
closed
3 weeks ago
1
Fix issue with device check failure
#167
jonb377
closed
4 weeks ago
1
Create `xpk` package with `utils` module
#166
PBundyra
closed
3 weeks ago
3
Create xpk package, utils module and refactor
#165
PBundyra
opened
1 month ago
0
Enabling Workload Identity and GCSFuse driver flags
#164
sharabiani
closed
2 weeks ago
0
Python3.10 fix - use CSV format for gcloud commands to simplify parsing
#163
nhira
closed
1 month ago
1
Allow SIGTERM error code to be returned from XPK
#162
Obliviour
closed
1 month ago
0
Create cluster from several reservations
#161
DwarKapex
opened
1 month ago
1
Fix non-accelerator pools from being part of accelerator node pool cr…
#160
Obliviour
opened
1 month ago
0
Use csv formatting instead in the gcloud command to split the names o…
#159
Obliviour
closed
1 month ago
1
xpk Cluster Queue resource group "cpu" resource quota incorrect for a CPU-only cluster
#158
bernardhan33
opened
1 month ago
11
Correct Suspend/Resume backoffLimit for Pathways
#157
SujeethJinesh
closed
1 month ago
3
Remove flag `pathways_compilation_mode` from xpk.py
#156
norx1991
closed
2 months ago
0
Remove incorrect plural from filter-by-job
#155
Obliviour
closed
2 months ago
0
Update XPK to support topology-aware scheduler for GPU workloads.
#154
yangyuwei
closed
2 months ago
1
Update the CloudDNS check.
#153
lukebaumann
opened
3 months ago
3
Add logic to fail Pathways jobs on user code errors.
#152
RoshaniN
closed
3 months ago
1
Add a check and update existing Pathways clusters to use CloudDNS.
#151
RoshaniN
closed
3 months ago
0
Move all clusters to be RAPID clusters, and verify them using valid_v…
#150
Obliviour
closed
3 months ago
0
Fixing device_type in nightly tests.
#149
RoshaniN
closed
3 months ago
0
Restrict Pathways unified debugging logs to just first worker
#148
SujeethJinesh
closed
3 months ago
0
Fix stacktrace sidecar container yaml
#147
SurbhiJainUSC
closed
3 months ago
0
Don't Kill RM or Proxy on user job failure
#146
SujeethJinesh
opened
3 months ago
0
Enable cluster and workload creation on A3+.
#145
yangyuwei
closed
3 months ago
1
Add Unified Logging View for Pathways on Cloud
#144
SujeethJinesh
closed
3 months ago
1
Update pathways server and proxy server image locations
#143
sadikneipp
closed
4 months ago
0
Fixing misleading message on Validate Docker Image.
#142
RoshaniN
closed
4 months ago
0
Pathways in headless mode.
#141
RoshaniN
closed
3 months ago
0
Allow JAX coordinator to find the JobSet name.
#140
RoshaniN
closed
4 months ago
0
Making exit flow similar to other XPK commands.
#139
RoshaniN
closed
4 months ago
1
Update pip version to 0.5.0
#138
SurbhiJainUSC
closed
4 months ago
1
Update xpk.py
#137
kyle-google
closed
4 months ago
2
Remove user-managed service account and attach default compute engine service account to node pools
#136
SurbhiJainUSC
closed
4 months ago
0
Dynamically determine GKE Version for Cluster and Node Pool Creation
#135
Obliviour
closed
4 months ago
0
Prevent Pathways SIGTERMs from counting against backoffLimit
#134
SujeethJinesh
opened
4 months ago
0
Next