Feeding does not respect resource limits, crashes node

buinauskas commented 2 weeks ago

Describe the bug

resource limits were not respected
single-node test deployment uses 100% disk space and bricks the whole deployment

we have a dedicated single-node vespa deployment to test features and optimizations, it helps us predict how changes will scale to larger deployments.

this test deployment has more usable memory than usable disk, can this be an issue for resource limiter?

To Reproduce Steps to reproduce the behavior:

Seed ~80M documents to vespa
Update these documents using Vespa's partial update feature to attach embeddings, there are ~3 photo embeddings per document
After some time, disk usage starts spiking, reaches 100%
The machine reports that too many inodes are used
Node goes down

That's the relevant embedding schema, we attach multi-dimensional photo clip embeddings where each dimension is a unique photo ID associated with that document.

field photo_embeddings type tensor<bfloat16>(photo_id{}, embedding[512]) {
    indexing: attribute | index
    attribute {
        fast-rank
        distance-metric: angular
    }
    index {
        hnsw {
            max-links-per-node: 16
            neighbors-to-explore-at-insert: 96
        }
    }
}

Expected behavior

vespa's resource limits to kick in
get 429 status codes and make sure that feed requests are rejected
vespa content node is reachable, responds to search requests

Screenshots

content_proton_resource_usage_disk_usage_total_max metric was used
2024-08-26 08:00 we start seeding
2024-08-26 19:00 seeding is over, embeddings are starting to be attached

Environment (please complete the following information):

OS: Docker
Infrastructure: self-hosted
Memory: 512G
Disk 893G RAID1, 446G usable

Vespa version 8.363.17

Additional context These are interesting logs and they go in such a sequence:

Aug 27, 2024 @ 17:16:48.000 what():  Fatal: Writing 2097152 bytes to '/opt/vespa/var/db/vespa/search/cluster.vinted/n2/documents/items/0.ready/attribute/photo_embeddings/snapshot-230031437/photo_embeddings.dat' failed (wrote -1): No space left on device

Aug 27, 2024 @ 17:16:48.000 PC: @     0x7faea85ef52f  (unknown)  raise

Aug 27, 2024 @ 17:16:48.000 terminate called after throwing an instance of 'std::runtime_error'

Aug 27, 2024 @ 17:16:48.000 *** SIGABRT received at time=1724768208 on cpu 64 ***

Aug 27, 2024 @ 17:17:04.000 Write operations are now blocked: 'diskLimitReached: { action: "add more content nodes", reason: "disk used (0.999999) > disk limit (0.9)", stats: { capacity: 475877605376, used: 475877257216, diskUsed: 0.999999, diskLimit: 0.9}}'

Aug 27, 2024 @ 17:17:21.000 Unable to get response from service 'searchnode:2193:RUNNING:vinted/search/cluster.vinted/2': Connect to http://localhost:19107 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused

That's our services.xml file:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<services version="1.0">
  <admin version="2.0">
    <slobroks>
      <slobrok hostalias="vespa-scale-readiness-cfg1.infra"/>
    </slobroks>
    <configservers>
      <configserver hostalias="vespa-scale-readiness-cfg1.infra"/>
    </configservers>
    <cluster-controllers>
      <cluster-controller hostalias="vespa-scale-readiness-cfg1.infra"/>
    </cluster-controllers>
    <adminserver hostalias="vespa-scale-readiness-cfg1.infra"/>
    <metrics>
      <consumer id="custom-metrics">
        <metric-set id="default"/>
        <metric id="update_with_create.count"/>
      </consumer>
    </metrics>
  </admin>
  <container id="default" version="1.0">
    <nodes>
      <jvm options="-Xms24g -Xmx24g -XX:+PrintCommandLineFlags -Xlog:disable"/>
      <node hostalias="vespa-scale-readiness-container1.infra"/>
    </nodes>
    <components>
      <include dir="ext/linguistics"/>
      <include dir="ext/clip"/>
    </components>
    <search>
      <include dir="searchers"/>
    </search>
    <document-processing>
      <chain id="default">
        <documentprocessor id="com.search.items.ItemsRankingProcessor" bundle="vespa"/>
        <documentprocessor id="com.search.items.CreatingUpdateTrackingProcessor" bundle="vespa"/>
      </chain>
    </document-processing>
    <model-evaluation/>
    <document-api/>
    <accesslog type="disabled"/>
  </container>
  <content id="vinted" version="1.0">
    <search>
      <coverage>
        <minimum>0.8</minimum>
        <min-wait-after-coverage-factor>0.2</min-wait-after-coverage-factor>
        <max-wait-after-coverage-factor>0.3</max-wait-after-coverage-factor>
      </coverage>
    </search>
    <redundancy>1</redundancy>
    <documents garbage-collection="true">
      <document type="items" mode="index"/>
      <document type="items_7d" mode="index" selection="items_7d.created_at &gt; now() - 604800"/>
    </documents>
    <engine>
      <proton>
        <searchable-copies>1</searchable-copies>
        <tuning>
          <searchnode>
            <requestthreads>
              <persearch>8</persearch>
              <search>256</search>
              <summary>64</summary>
            </requestthreads>
            <removed-db>
              <prune>
                <age>86400</age>
              </prune>
            </removed-db>
          </searchnode>
        </tuning>
      </proton>
    </engine>
    <group>
      <distribution partitions="1|*"/>
      <group distribution-key="1" name="group1">
        <node distribution-key="2" hostalias="vespa-scale-readiness-data1.infra"/>
      </group>
    </group>
  </content>
</services>

vekterli commented 2 weeks ago

The spikes you are observing are almost certainly caused by flushing of in-memory data structures to disk, which requires temporary disk usage that is proportional to the memory used by that data structure (in this case presumably a large tensor attribute).

As a general rule, it is recommended to have a disk size of at least 3x that of the memory size to avoid resource constraints during flushing and compactions.

The automatic feed blocking mechanisms are not currently clever enough to anticipate the impact that future flushes will have based on the already fed data. We should ideally look at the ratio of host memory to disk and automatically derive a reasonable default block threshold based on this—it is clear that the default limits are not appropriate for high memory + low disk setups.

buinauskas commented 2 weeks ago

I have to admit that our test hardware is quite unusual, but we have to deal with what we got. It's good that we discovered it in such circumstances.

As a general rule, it is recommended to have a disk size of at least 3x that of the memory size to avoid resource constraints during flushing and compactions.

We'll keep that in mind.

We have now reduced our test dataset size and are happy to know what caused the problem. Should the issue be left open? It does seem like a bug for a rare edge case and not of a huge importance due to likeliness to happen.

vekterli commented 1 week ago

Should the issue be left open? It does seem like a bug for a rare edge case and not of a huge importance due to likeliness to happen.

I'm leaving the issue open for now, as it'd be a good thing to detect and at least warn about.

vespa-engine / vespa

Feeding does not respect resource limits, crashes node #32288