vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.62k stars 589 forks source link

Reindexing is getting stalled #30513

Open dainiusjocas opened 6 months ago

dainiusjocas commented 6 months ago

Describe the bug When a reindexing process is triggered and one or more of the synthetic fields in the indexing scripts invokes embed the progress seem to be stalled.

To Reproduce Steps to reproduce the behavior:

  1. On an existing index with a field like:
    field chunks type array<string> {}
  2. Add a synthetic field:
    field colbert type tensor<int8>(context{}, token{}, v[16]) {
    indexing: input chunks | embed colbert context | attribute
    }
  3. Trigger reindexing.
  4. After some initial progress (see screenshot below) the reindexing progress has stopped.
  5. Sometime the reindexing fails with status
    {
    "enabled": true,
    "clusters": {
    "realm": {
      "pending": {},
      "ready": {
        "realm": {
          "readyMillis": 1709661753702,
          "speed": 1.0,
          "cause": "reindexing for an unknown reason",
          "startedMillis": 1709664660006,
          "endedMillis": 1709701153065,
          "message": "PROCESSING_FAILURE: ReturnCode(PROCESSING_FAILURE, [from content node 1] Time is up.)",
          "progress": 0.0,
          "state": "failed"
        }
      }
    }
    }
    }

Expected behavior I understand that inference on CPU takes time and embedding arrays of strings is not the best of ideas. It would be great to have mo control over reindexing:

Also, more visibility into progress would be nice. Maybe a count of documents reindexed so far. Furthermore, if somehow recalculating embeddings on synthetic fields could be skipped by checking hashes or something that also would be great.

Screenshots Added dashboard.

Screenshot 2024-03-08 at 09 38 35

Environment (please complete the following information):

Vespa version 8.307.19

Additional context Slack thread. An interesting discovery: when persearch was reduced from being equal to the amount of CPU cores available to 1, the reindexing started progressing.

kkraune commented 1 month ago

@jonmv can you look at the timeout value?