Open buinauskas opened 2 weeks ago
The spikes you are observing are almost certainly caused by flushing of in-memory data structures to disk, which requires temporary disk usage that is proportional to the memory used by that data structure (in this case presumably a large tensor attribute).
As a general rule, it is recommended to have a disk size of at least 3x that of the memory size to avoid resource constraints during flushing and compactions.
The automatic feed blocking mechanisms are not currently clever enough to anticipate the impact that future flushes will have based on the already fed data. We should ideally look at the ratio of host memory to disk and automatically derive a reasonable default block threshold based on this—it is clear that the default limits are not appropriate for high memory + low disk setups.
I have to admit that our test hardware is quite unusual, but we have to deal with what we got. It's good that we discovered it in such circumstances.
As a general rule, it is recommended to have a disk size of at least 3x that of the memory size to avoid resource constraints during flushing and compactions.
We'll keep that in mind.
We have now reduced our test dataset size and are happy to know what caused the problem. Should the issue be left open? It does seem like a bug for a rare edge case and not of a huge importance due to likeliness to happen.
Should the issue be left open? It does seem like a bug for a rare edge case and not of a huge importance due to likeliness to happen.
I'm leaving the issue open for now, as it'd be a good thing to detect and at least warn about.
Describe the bug
we have a dedicated single-node vespa deployment to test features and optimizations, it helps us predict how changes will scale to larger deployments.
this test deployment has more usable memory than usable disk, can this be an issue for resource limiter?
To Reproduce Steps to reproduce the behavior:
That's the relevant embedding schema, we attach multi-dimensional photo clip embeddings where each dimension is a unique photo ID associated with that document.
Expected behavior
Screenshots
content_proton_resource_usage_disk_usage_total_max
metric was usedEnvironment (please complete the following information):
Vespa version 8.363.17
Additional context These are interesting logs and they go in such a sequence:
That's our services.xml file: