vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.58k stars 586 forks source link

Feed sizing - unable to increase throughput but resources unutilized #22533

Closed nehajatav closed 2 years ago

nehajatav commented 2 years ago

Describe the bug We are running feed client and tuning the number of instances of feeder client. When increasing the instances of feeder, we notice that the host running the feeder has no resource contention (~10% cpu and memory used). We have configured container allocatedMemory to 90% of 64G - this is nearly constant as is cpu ~4/8 upon increasing feeders. The feeding concurrency of content is 0.8 (8 cpu available) - resource utilization on this is also nearly constant 3/8 cores used and 46/64 (feed block at ~60). When we keep only 1 feeder instance, we get for every 2k ingestions, ~500ms of avg latency. However, when we bump it up to 3, the average latency goes to ~2000ms, with resource utilization constant (container memory and cpu bumps up but still well within limits for a small duration but it again settles down). The overall throughput remains constant while the resources remain unutilized. Is this expected?

To Reproduce Steps to reproduce the behavior:

  1. Bring up a multi node cluster as per host and services.xml in this comment with container allocatedMemory to 90%: https://github.com/vespa-engine/vespa/issues/22315#issuecomment-1113074776
  2. keep only 1 feeder instance and note the latency
  3. bump it up to 3 and note the latency
  4. The overall throughput remains constant while the resources remain unutilized

Expected behavior The latency should NOT have gone up and the overall throughput should have increased

Environment (please complete the following information): OS: Docker image vespa:7.559.12 Infrastructure: Kubernetes Versions Major:"1", Minor:"21", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"

Vespa version 7.559.12 compiled with go1.16.13 on linux/amd64

jobergum commented 2 years ago

You have reached a bottleneck, attempting to push more load increases latency due to queueing. Most likely related to IO.

What type of IO are these instances configured with? Indexing is also write-heavy and all document operations are written to the transaction log and each batch synched for durability. Synch is known to have a high cost on network attached storage.. This can be toggled by sync-transactionlog

nehajatav commented 2 years ago

@jobergum this particular one was on Ceph implementation of PVC

jobergum commented 2 years ago

Yes, Not familiar with Ceph, but smells like slow high latency remote storage. See my comment above on turning off synching for remote storage.

nehajatav commented 2 years ago

We tried turning off sync-transactionlog on Ceph, we didn't get any improvement in feeding On NAS, we saw a minor improvement