Closed piontec closed 1 month ago
As a task for the main story, I would add
So, what's done already
The main cluster that is used for testing is golem. Zot is also deployed to goat and to grizzly, but on goat it doesn't work yet, and on grizzly it doesn't have a real load. So I'm focusing on the golem
atm
Zot can be configured as a pull-through cache via tha happa UI.
# a path to the config
- Global
- Components
- Containerd
- Container Registries
Add a new entry, the key should be the url of a reg that you want to be cached, in our case zot is configured to cache images from the gsoci
registry.
So add a new key: gsoci.azurecr.io
Add new values
- `zot.golem.gaws.gigantic.io`
- ` gsoci.azurecr.io` # I'm not sure that it's the correct way to set it up, but I'm doing it this way, maybe there is no need for duplication, I'll check it later
Zot's grafana dashboard is available here: https://grafana.golem.gaws.gigantic.io/d/2JZeZ6hSk/in-cluster-container-registry-zot?orgId=1
While a new cluster is being created, you should be able to see that there is some activity going on on the zot.
I've configured only read access for anonymouse users, because caching is considered a read-action. So clusters can pull images through zot without credentials, but it would also mean that private images will be publicly available once they are cached, so currently zot should be used only for public images.
There are two users that are defined for Zot :
A prom user is a user that should be used by prometheus to scrape metrics (if the default config is used, find it here: https://github.com/giantswarm/management-cluster-bases/blob/main/extras/zot/values.yaml)
An admin is not supposed to be used at all, because we don't want to use zot as a registry. But in case one needs to use it, the password can be found in the same k8s secret that is used for secret helm values. (https://github.com/giantswarm/giantswarmc-management-clusters/blob/main/management-clusters/golem/extras/resources/zot-secret-values.yaml)
It's added as a separate value, because zot needs only a htpasswd file, and the password can't be gotten from there.
Currently, zot's config consists of three parts
I've tried to put as much as possible to the base layer, so only the parts that are not the same per installation are excluded to per-cluster kustomizations, but there is a room for improvement.
Secret values can be generated with a script: https://github.com/giantswarm/management-cluster-bases/blob/main/extras/zot/make_passwords.sh
There are still issues that have to be figured out
I would say that the only showstopper for us is the first problem (storage), because we need to make sure we have enough room for cache and be able to get alerted. So once it's fixed (or at least worked around), we can proceed.
After those 4 steps are made on testing clusters, we can check how it's going and once we have some understanding we can proceed with deploying it to production clusters
Issue about metadata being pushed to zot: https://github.com/project-zot/zot/issues/2392
Recreate
The "storage thing", I don't completely understand yet. But that's what I've found out.
Zot is populating value per repo by walking through directories and getting a size of anything that is is not a dir (blobs I assume). It's doing it for every repository but it's not checking the _trivy
directory that is currently taking about 1.3Gb. Hence we can't fully rely on zot metrics here, I would rather use kubernetes volumes metrics to make sure we're not running out of space.
The zot's one seems to be rather informative than critical. To get the used space, we could use something like that: kubelet_volume_stats_used_bytes
Snail
is configured to use zot that is deployed to itself (zot.snail.gaws.gigantic.io) as a pull-thru cache
We've decided to try configuring a kyverno policy to update containerd configuration per WC, so they are using zot.
Setup per WC cache
Create a new WC with a config like that:
global:
components:
containerd:
containerRegistries:
gsoci.azurecr.io:
- endpoint: http://127.0.0.1:32767
- endpoint: gsoci.azurecr.io
After it's created, install Zot as an App with values like that:
service:
type: NodePort
port: 5000
nodePort: 32767
# Annotations to add to the service
annotations: {}
# Set to a static IP if a static IP is desired, only works when
# type: ClusterIP
clusterIP: null
strategy:
type: Recreate
serviceMonitor:
enabled: false
persistence: true
pvc:
create: true
accessMode: ReadWriteOnce
storage: 50Gi
policyException:
enforce: true
global:
podSecurityStandards:
enforced: true
configFiles:
config.json: |-
{
"storage":
{
"rootDirectory": "/var/lib/registry",
"dedupe": true,
"gc": true,
"gcDelay": "1h",
"gcInterval": "24h"
},
"http":
{
"address": "0.0.0.0",
"port": "5000"
},
"log":
{
"level": "debug"
},
"extensions": {
"sync": {
"registries": [
{
"urls": [
"https://gsoci.azurecr.io"
],
"onDemand": true,
"tlsVerify": true,
"maxRetries": 3,
"retryDelay": "5m"
}
]
},
"scrub": {
"enable": true
},
"search": {
"enable": true,
"cve": {
"updateInterval": "2h"
}
},
"metrics": {
"enable": true,
"prometheus": {
"path": "/metrics"
}
}
}
}
After removing some pods and making sure images are pulled again,
❯ curl -s localhost:5000/v2/_catalog | jq
{
"repositories": [
"giantswarm/background-controller",
"giantswarm/cert-exporter",
"giantswarm/coredns",
"giantswarm/kyverno",
"giantswarm/kyvernopre",
"giantswarm/policy-reporter",
"giantswarm/policy-reporter-kyverno-plugin",
"giantswarm/policy-reporter-ui",
"giantswarm/prometheus",
"giantswarm/prometheus-config-reloader",
"giantswarm/reports-controller"
]
}
Where localhost
is a port-forwarded zot in a WC
I think that we might want to stop using the inline json config, and used toJson
helm function, then instead of providing a huge values with all the config that are mostly defined in default values, users would be able to pass a list of registries that they want to have mirrored.
TODOS:
After the kyverno policy was deployed to the snail cluster, I can see that snail's zot is used by default, so I guess we can deploy it to other clusters to collect some zot's metrics and understand what should be used for alerting
UPD: I think there is a flaky behaviour, from time to time clusters are not created because the cm is missing, so I need to fix it first
After https://github.com/giantswarm/management-cluster-bases/pull/134 was merged, it seems like it's working fine again
I've tried creating 10 clusters in the same time, and zot seemed to feel fine.
I've updated zot's config, and memory consumption has noticeably decreased. It's not what I've promised in the PR, but in grizzly currently is up to 900Mb.
I've noticed that storage panels in the zot's grafana dashboards are not working, at least in grizzly and goat, so I need to fix them, and after it's done, I'd like to put this ticket on hold and wait a bit to gather metrics and complains.
@piontec does this issue still needs to be watched? Is there a way forward or can it be closed?
Done, closing
Let's prepare
zot
for deployment withflux
.To get started:
config.json