redhat-cop / rhel-edge-automation-arch

RHEL for Edge Automation Deployment Architecture
Apache License 2.0
46 stars 33 forks source link

Mirror OStree repository from Stage fails due to min free space being less than 3% in hello-world #214

Open jordigilh opened 2 years ago

jordigilh commented 2 years ago

When running the rfe-oci-publish-content pipeline for the hello-world example, I noticed that the Mirror OStree repository from Stage fails when running this command:

sh-4.4$ ostree --repo=/var/www/html/hello-world/latest pull --mirror hello-world-latest rhel/8/x86_64/edge
Writing objects: 10                                                                                                                                                                                                                                             
error: Writing content object: min-free-space-percent '3%' would be exceeded, at least 4.2 MB requested

There is plenty of space available in the pod:

sh-4.4$ df -h
Filesystem                                                                                                                                                Size  Used Avail Use% Mounted on
overlay                                                                                                                                                    49G   21G   28G  43% /
tmpfs                                                                                                                                                      64M     0   64M   0% /dev
tmpfs                                                                                                                                                      24G     0   24G   0% /sys/fs/cgroup
shm                                                                                                                                                        64M     0   64M   0% /dev/shm
tmpfs                                                                                                                                                      24G   64M   24G   1% /etc/passwd
/dev/vda4                                                                                                                                                  49G   21G   28G  43% /etc/hosts
10.131.64.145:6789,10.130.28.142:6789,10.129.232.195:6789:/volumes/csi/csi-vol-54522f7c-640e-11ec-bf74-0a58ac1e0810/82e6d092-d677-4c13-99e2-f4a3345424c8  100G     0  100G   0% /var/www/html
tmpfs                                                                                                                                                      24G   20K   24G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                                                                                                                                      24G     0   24G   0% /proc/acpi
tmpfs                                                                                                                                                      24G     0   24G   0% /proc/scsi
tmpfs                                                                                                                                                      24G     0   24G   0% /sys/firmware

However, if I add this cofiguration to the [core] section in /var/www/html/config

min-free-space-percent=0

Which looks like this:

[core]
repo_version=1
mode=archive-z2
min-free-space-percent=0

[remote "hello-world-latest"]
url=http://hello-world-latest-httpd.rfe.svc.cluster.local/repo

The command runs successfully:

sh-4.4$ ostree --repo=/var/www/html/hello-world/latest pull --mirror hello-world-latest rhel/8/x86_64/edge
3983 metadata, 28529 content objects fetched; 775294 KiB transferred in 430 seconds; 1.9 GB content written 

This solution might work for this use case, but repos that are bigger than this one will fail even with this configuration change.

ygalblum commented 2 years ago

I've found the root cause of this issue, but I'm not sure how to address it.

When starting a transaction, ostree checks the amount of free blocks. It does that by getting information using fstatfs and dividing the free size with the size a block. For each object it downloads, it calculates the amount of blocks the object will take (that is / + 1). It then checks that it has enough available blocks and if yes, reduces the amount of free blocks for the next object.

I ran the same sequence once on the root filesystem and another on the PVC and found that while the former has a block size of 4K, the latter's is 4M. Since many objects are small (even less that 4K), in the PVC, they are calculated as if each one takes 4M while in the RootFS, the size calculated for them is much closer to their actual size. As a result of the 4M block size, for a 100GB drive, no more than 25K objects may be downloaded.

My guess is that when running on AWS (which causes the default storage class to be AWS) the block size is 4K and that is why this issue is not observed there. However, when using CEPH, the issue occurs every time.

ygalblum commented 2 years ago

Another issue that @jordigilh raised is that this process disregards other processes that might consume the disk. Since the free amount is calculated once in the beginning of the process, if more than one process is filling up the disk, this process will not know about it

ygalblum commented 2 years ago

Some more findings. It seems that this is not a container issue. I've SSHed to the node running the container and run stat -f on the mount point of the PV. There seems to be two issues:

  1. The block size is set to 4M.
  2. The Free blocks counter equal the Total blocks even though the directory has content in it (and du -ch returns 766M). This also seems to affect df -h since it also returns 0 on the Used column.
jordigilh commented 2 years ago

Can we use min-free-space-size and set it to a reasonable amount, like 500Gb? As per documentation:

if `min-free-space-size` is set to a non-value, `min-free-space-percent` is ignored

I like this option better than setting the percentage to 0, and a 1Gb or even 500Mb should be sufficient for our long term purposes (when you have 100Gb of storage available :smile:, of course ) and a guarantee that we don't end up filling the PVC.

Although, when you come to think about it, since it's a mount point that is not critical to the OS (!=/), if it fills up it won't be terrible, since it won't crash the httpd pod. WDTY? What we need to make sure is that there is a cleaning process or we have some way to make sure the data does not grow over its PV limits.

ygalblum commented 2 years ago

In general, I agree that setting a know size instead of a percentage is better. As you said, it's not the RootFS and hence the volume's size should not affect the free size we want to keep available. But, I do have some comments:

  1. Setting the number will not solve the issue we have with CEPH. The root cause of this issue is the fact that according to stat the block size is 4M (amounting only 25K blocks) while in fact we see that the files are saved in 1K blocks.
  2. Regarding your comment about not crashing httpd, you are right that it will not crash. But, what's the point of a running server if the data is incorrect?

As for cleaning, for sure we need a process for it. But that raises additional questions:

  1. Currently, all images are served from a single httpd server. Is this the intended behavior going forward?
  2. How do we scale the httpd service if the mirroring command is executed locally on a ReadWriteOnce PV?
jordigilh commented 2 years ago

Even with such protection in place (3% or size) it can potentially fail when running two mirroring tasks concurrently when running close to the disk allocation limit. I don't think the current approach has been modeled to support concurrent builds, and even so the amount of disk we are allocating (100Gi) is far more than what is needed for short term. So the issue is not just ostree not calculating the amount correct of space available because of CEPH, but also that the current design needs improvements for supporting concurrent builds with dedicated image-builders/httpd servers.

ygalblum commented 2 years ago

I agree with you that ostree disregards anything else that might be using the disk when it is downloading and I guess we should open a ticket for them. Having said that, we need to understand our design in order to understand how important (or not) this issue is to us