Open jordigilh opened 2 years ago
I've found the root cause of this issue, but I'm not sure how to address it.
When starting a transaction, ostree
checks the amount of free blocks. It does that by getting information using fstatfs
and dividing the free size with the size a block. For each object it downloads, it calculates the amount of blocks the object will take (that is
I ran the same sequence once on the root filesystem and another on the PVC and found that while the former has a block size of 4K, the latter's is 4M. Since many objects are small (even less that 4K), in the PVC, they are calculated as if each one takes 4M while in the RootFS, the size calculated for them is much closer to their actual size. As a result of the 4M block size, for a 100GB drive, no more than 25K objects may be downloaded.
My guess is that when running on AWS (which causes the default storage class to be AWS) the block size is 4K and that is why this issue is not observed there. However, when using CEPH, the issue occurs every time.
Another issue that @jordigilh raised is that this process disregards other processes that might consume the disk. Since the free
amount is calculated once in the beginning of the process, if more than one process is filling up the disk, this process will not know about it
Some more findings. It seems that this is not a container issue. I've SSHed to the node running the container and run stat -f
on the mount point of the PV. There seems to be two issues:
du -ch
returns 766M). This also seems to affect df -h
since it also returns 0 on the Used
column. Can we use min-free-space-size
and set it to a reasonable amount, like 500Gb
? As per documentation:
if `min-free-space-size` is set to a non-value, `min-free-space-percent` is ignored
I like this option better than setting the percentage to 0, and a 1Gb or even 500Mb should be sufficient for our long term purposes (when you have 100Gb of storage available :smile:, of course ) and a guarantee that we don't end up filling the PVC.
Although, when you come to think about it, since it's a mount point that is not critical to the OS (!=/)
, if it fills up it won't be terrible, since it won't crash the httpd
pod. WDTY?
What we need to make sure is that there is a cleaning process or we have some way to make sure the data does not grow over its PV limits.
In general, I agree that setting a know size instead of a percentage is better. As you said, it's not the RootFS and hence the volume's size should not affect the free size we want to keep available. But, I do have some comments:
stat
the block size is 4M (amounting only 25K blocks) while in fact we see that the files are saved in 1K blocks.httpd
, you are right that it will not crash. But, what's the point of a running server if the data is incorrect?As for cleaning, for sure we need a process for it. But that raises additional questions:
httpd
server. Is this the intended behavior going forward?httpd
service if the mirroring command is executed locally on a ReadWriteOnce PV?Even with such protection in place (3% or size) it can potentially fail when running two mirroring tasks concurrently when running close to the disk allocation limit. I don't think the current approach has been modeled to support concurrent builds, and even so the amount of disk we are allocating (100Gi) is far more than what is needed for short term.
So the issue is not just ostree
not calculating the amount correct of space available because of CEPH, but also that the current design needs improvements for supporting concurrent builds with dedicated image-builders/httpd servers.
I agree with you that ostree
disregards anything else that might be using the disk when it is downloading and I guess we should open a ticket for them.
Having said that, we need to understand our design in order to understand how important (or not) this issue is to us
When running the
rfe-oci-publish-content
pipeline for thehello-world
example, I noticed that the Mirror OStree repository from Stage fails when running this command:There is plenty of space available in the pod:
However, if I add this cofiguration to the
[core]
section in/var/www/html/config
Which looks like this:
The command runs successfully:
This solution might work for this use case, but repos that are bigger than this one will fail even with this configuration change.