vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.62k stars 589 forks source link

Feeding does not respect resource limits, crashes node #32288

Open buinauskas opened 2 weeks ago

buinauskas commented 2 weeks ago

Describe the bug

we have a dedicated single-node vespa deployment to test features and optimizations, it helps us predict how changes will scale to larger deployments.

this test deployment has more usable memory than usable disk, can this be an issue for resource limiter?

To Reproduce Steps to reproduce the behavior:

  1. Seed ~80M documents to vespa
  2. Update these documents using Vespa's partial update feature to attach embeddings, there are ~3 photo embeddings per document
  3. After some time, disk usage starts spiking, reaches 100%
  4. The machine reports that too many inodes are used
  5. Node goes down

That's the relevant embedding schema, we attach multi-dimensional photo clip embeddings where each dimension is a unique photo ID associated with that document.

field photo_embeddings type tensor<bfloat16>(photo_id{}, embedding[512]) {
    indexing: attribute | index
    attribute {
        fast-rank
        distance-metric: angular
    }
    index {
        hnsw {
            max-links-per-node: 16
            neighbors-to-explore-at-insert: 96
        }
    }
}

Expected behavior

Screenshots

image

Environment (please complete the following information):

Vespa version 8.363.17

Additional context These are interesting logs and they go in such a sequence:

Aug 27, 2024 @ 17:16:48.000 what():  Fatal: Writing 2097152 bytes to '/opt/vespa/var/db/vespa/search/cluster.vinted/n2/documents/items/0.ready/attribute/photo_embeddings/snapshot-230031437/photo_embeddings.dat' failed (wrote -1): No space left on device

Aug 27, 2024 @ 17:16:48.000 PC: @     0x7faea85ef52f  (unknown)  raise

Aug 27, 2024 @ 17:16:48.000 terminate called after throwing an instance of 'std::runtime_error'

Aug 27, 2024 @ 17:16:48.000 *** SIGABRT received at time=1724768208 on cpu 64 ***

Aug 27, 2024 @ 17:17:04.000 Write operations are now blocked: 'diskLimitReached: { action: "add more content nodes", reason: "disk used (0.999999) > disk limit (0.9)", stats: { capacity: 475877605376, used: 475877257216, diskUsed: 0.999999, diskLimit: 0.9}}'

Aug 27, 2024 @ 17:17:21.000 Unable to get response from service 'searchnode:2193:RUNNING:vinted/search/cluster.vinted/2': Connect to http://localhost:19107 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused

That's our services.xml file:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<services version="1.0">
  <admin version="2.0">
    <slobroks>
      <slobrok hostalias="vespa-scale-readiness-cfg1.infra"/>
    </slobroks>
    <configservers>
      <configserver hostalias="vespa-scale-readiness-cfg1.infra"/>
    </configservers>
    <cluster-controllers>
      <cluster-controller hostalias="vespa-scale-readiness-cfg1.infra"/>
    </cluster-controllers>
    <adminserver hostalias="vespa-scale-readiness-cfg1.infra"/>
    <metrics>
      <consumer id="custom-metrics">
        <metric-set id="default"/>
        <metric id="update_with_create.count"/>
      </consumer>
    </metrics>
  </admin>
  <container id="default" version="1.0">
    <nodes>
      <jvm options="-Xms24g -Xmx24g -XX:+PrintCommandLineFlags -Xlog:disable"/>
      <node hostalias="vespa-scale-readiness-container1.infra"/>
    </nodes>
    <components>
      <include dir="ext/linguistics"/>
      <include dir="ext/clip"/>
    </components>
    <search>
      <include dir="searchers"/>
    </search>
    <document-processing>
      <chain id="default">
        <documentprocessor id="com.search.items.ItemsRankingProcessor" bundle="vespa"/>
        <documentprocessor id="com.search.items.CreatingUpdateTrackingProcessor" bundle="vespa"/>
      </chain>
    </document-processing>
    <model-evaluation/>
    <document-api/>
    <accesslog type="disabled"/>
  </container>
  <content id="vinted" version="1.0">
    <search>
      <coverage>
        <minimum>0.8</minimum>
        <min-wait-after-coverage-factor>0.2</min-wait-after-coverage-factor>
        <max-wait-after-coverage-factor>0.3</max-wait-after-coverage-factor>
      </coverage>
    </search>
    <redundancy>1</redundancy>
    <documents garbage-collection="true">
      <document type="items" mode="index"/>
      <document type="items_7d" mode="index" selection="items_7d.created_at &gt; now() - 604800"/>
    </documents>
    <engine>
      <proton>
        <searchable-copies>1</searchable-copies>
        <tuning>
          <searchnode>
            <requestthreads>
              <persearch>8</persearch>
              <search>256</search>
              <summary>64</summary>
            </requestthreads>
            <removed-db>
              <prune>
                <age>86400</age>
              </prune>
            </removed-db>
          </searchnode>
        </tuning>
      </proton>
    </engine>
    <group>
      <distribution partitions="1|*"/>
      <group distribution-key="1" name="group1">
        <node distribution-key="2" hostalias="vespa-scale-readiness-data1.infra"/>
      </group>
    </group>
  </content>
</services>
vekterli commented 2 weeks ago

The spikes you are observing are almost certainly caused by flushing of in-memory data structures to disk, which requires temporary disk usage that is proportional to the memory used by that data structure (in this case presumably a large tensor attribute).

As a general rule, it is recommended to have a disk size of at least 3x that of the memory size to avoid resource constraints during flushing and compactions.

The automatic feed blocking mechanisms are not currently clever enough to anticipate the impact that future flushes will have based on the already fed data. We should ideally look at the ratio of host memory to disk and automatically derive a reasonable default block threshold based on this—it is clear that the default limits are not appropriate for high memory + low disk setups.

buinauskas commented 2 weeks ago

I have to admit that our test hardware is quite unusual, but we have to deal with what we got. It's good that we discovered it in such circumstances.

As a general rule, it is recommended to have a disk size of at least 3x that of the memory size to avoid resource constraints during flushing and compactions.

We'll keep that in mind.

We have now reduced our test dataset size and are happy to know what caused the problem. Should the issue be left open? It does seem like a bug for a rare edge case and not of a huge importance due to likeliness to happen.

vekterli commented 1 week ago

Should the issue be left open? It does seem like a bug for a rare edge case and not of a huge importance due to likeliness to happen.

I'm leaving the issue open for now, as it'd be a good thing to detect and at least warn about.