issues
search
leptonai
/
gpud
Apache License 2.0
188
stars
11
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
fix(nvidia/infiniband): do not set unhealthy if infiniband is not supported
#140
gyuho
closed
16 hours ago
0
nits(internal/server): clean up xid/dmesg component dependency logic
#139
gyuho
closed
17 hours ago
0
fix(nvidia/xid): do not error log when no xid happened yet
#138
gyuho
closed
3 days ago
0
fix(nvidia): persistence mode check based on NVML, do not rely on "nvidia-persistenced" binary
#137
gyuho
closed
3 days ago
0
fix(status): fix divide by zero
#136
cardyok
closed
4 days ago
0
nits(nvidia/query): make detect logs debug level
#135
gyuho
closed
4 days ago
0
feat(network/latency): track latency in metrics per region
#134
gyuho
closed
4 days ago
0
fix(nvidia/query/metrics): remove duplicate metric register call
#133
gyuho
closed
6 days ago
0
feat(nvidia): exposing SM core and tensor core metrics in GPUd
#132
photoszzt
closed
1 week ago
1
fix(server): handle "components" URL query, return 404 not found on unknown component queries
#131
gyuho
closed
4 days ago
0
fix(components): do not panic when there's no data collected yet
#130
gyuho
closed
1 week ago
0
fix(containerd): readable query failure error message (When CRI is not set up)
#129
gyuho
closed
1 week ago
0
fix(nvidia-smi/parse): do not parse remapped rows N/A
#128
gyuho
closed
1 week ago
0
fix(nvidia): use NVML + lspci to detect NVIDIA GPUs (without running nvidia-smi)
#127
gyuho
closed
4 days ago
0
nits(server): debug level log for redundant register attempts
#126
gyuho
closed
1 week ago
0
feat(component/network): latency checks to global edge/DERP servers (using tailscale)
#125
gyuho
closed
1 week ago
4
fix(infiniband): simplify ibstat existence when evaluating healthy
#124
gyuho
closed
4 days ago
0
feat(charts): add gpud run helm chart
#123
gyuho
closed
5 days ago
1
feat(gpud): "gpud run --auto-update-exit-code" for daemon set auto update use case (optional)
#122
gyuho
closed
2 weeks ago
0
feat(nvidia): add bad-envs component for `DCGM_FR_BAD_CUDA_ENV` logic in DCGM
#121
gyuho
closed
2 weeks ago
0
feat(nvidia): add persistence-mode (both legacy, persistenced daemon checks), implements `DCGM_FR_PERSISTENCE_MODE` in DCGM
#120
gyuho
closed
2 weeks ago
0
feat(nvidia): inspect process zombie status, bad env vars for CUDA per process (`DCGM_FR_BAD_CUDA_ENV`)
#119
gyuho
closed
2 weeks ago
2
nits(nvidia): fix Xid comments in error descriptions
#118
gyuho
closed
2 weeks ago
0
fix(nvidia): suggest reboot for Xid 45, async nvidia-smi checks to not be stuck
#117
gyuho
closed
2 weeks ago
4
fix(session): close writer goroutine
#116
cardyok
closed
2 weeks ago
0
feat(reboot): support optional delay reboot (reboot immediately by default)
#115
gyuho
closed
2 weeks ago
0
feat(nvidia/nvml): update nvlib to 0.7.0, rename device ID fields
#114
gyuho
closed
2 weeks ago
0
fix(fabric manager, nccl): fix fabric manager regex, add NCCL monitoring using dmesg
#113
gyuho
closed
2 weeks ago
0
feat(nvidia/infiniband): suggest repair hardware for infiniband switch down
#112
gyuho
closed
2 weeks ago
0
feat(nvidia/info): report GPU device count from "/dev" (`DCGM_FR_DEVICE_COUNT_MISMATCH` DCGM)
#111
gyuho
closed
2 weeks ago
0
fix(docker-container): "gpud run --docker-ignore-connection-errors" to ignorer docker daemon connection errors (do not ignore by default)
#110
gyuho
closed
2 weeks ago
1
fix(k8s/pod): "gpud run --kubelet-ignore-connection-errors" to not mark unhealthy when read only port is not open (does not ignore by default)
#109
gyuho
closed
2 weeks ago
0
fix(components/accelerator-nvidia-ecc): do not unhealthy when driver recovers uncorrectable ecc errors
#108
gyuho
closed
2 weeks ago
0
fix(reboot): sudo typo
#107
cardyok
closed
2 weeks ago
0
fix(install.sh): print download failure debugging info
#106
gyuho
closed
2 weeks ago
0
fix(install.sh): fix install doc links
#105
gyuho
closed
2 weeks ago
0
fix(systemd, nvidia): mark dbus connection not available if not initialized, to avoid nil pointer panic
#104
gyuho
closed
2 weeks ago
0
Running gpud as a daemonset/pod in an EKS cluster
#103
chatter92
closed
1 week ago
6
Unable to install from pkg.gpud.dev
#102
chatter92
closed
1 week ago
3
feat(components/library): periodically check libnvidia/libcuda* (experimental)
#101
gyuho
closed
2 weeks ago
0
feat(nvidia/xid): add check user app and GPU action type, apply "Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning by DeepSeek AI"
#100
gyuho
closed
3 weeks ago
0
feat(nvidia/ibstat): check "Physical state" as fallback
#99
gyuho
closed
2 weeks ago
0
feat(session): support reboot method
#98
cardyok
closed
3 weeks ago
0
feat(build, release): support Amazon Linux 2 and 2023 (experimental)
#97
gyuho
closed
3 weeks ago
0
feat(pkg/reboot): initial commit
#96
gyuho
closed
3 weeks ago
0
feat(components): add accelerator detect func, "gpud accelerator" subcommand
#95
gyuho
closed
2 weeks ago
0
feat(server): allow custom uid with cli
#94
cardyok
closed
3 weeks ago
0
fix(components/fd): rename "fd_max_file_exists" to "fd_limit_supported", fix get limit on darwin
#93
gyuho
closed
3 weeks ago
0
feat(gpud): add "file" component that returns healthy when all specified files exist
#92
gyuho
closed
3 weeks ago
0
doc(sxid): add more example events for gpu-operator
#91
gyuho
closed
4 weeks ago
1
Next