Vespa Query and Ingestion Timeout

vespa-engine / vespa

AI + Data, online. https://vespa.ai

https://vespa.ai

Apache License 2.0

5.58k stars 586 forks source link

Vespa Query and Ingestion Timeout #22276

Closed 107dipan closed 2 years ago

107dipan commented 2 years ago

Describe the bug We are currently using a vespa cluster with 3 groups and each group having 6 content nodes. While performing our load tests we found that many requests were timing out. We got timeout in 428/2571 requests. On closer analysis we found that all the nodes of group 0 were down. We want to understand if these requests were timing out because requests are going to the unhealthy group. If these requests are not going to the unhealthy group could you tell us why the requests are timing out since we have 2 other healthy groups. In our search requests we are passing a timeout of 120s.

Environment (please complete the following information):

Infrastructure: Kubernetes, Azure, AWS, self-hosted
Versions: 7.559.12

Vespa version 7.559.12

Additional context Add any other context about the problem here.

bratseth commented 2 years ago

The group that is down will not participate in queries. Hard to tell with this limited information why you're getting timeouts but you are probably overloading those two groups.

Start with a low load and then stepwise increase it. Measure the latency at each load increment. You'll get a hockey-stick-shaped curve, and the knee of the curve is the maximum load you can apply. Here you're probably above that which is why you see very high response times.

Once you have determined the max load (the knee) you can identify what resources you are running out of when you apply that load, such as e.g cpu on content nodes etc., by investigating metrics.

Once you know that, you can try to increase it. Either by increasing that resources, or make some changes to your application or queries to reduce usage of it.

107dipan commented 2 years ago

Is there any logs or tools using which we can understand to which group/ content node the ingestion requests are going? I think we can get this from the trace level logs during query.

nehajatav commented 2 years ago

The issue is with both query and ingest - one in every 7-8 requests goes into an unknown state (eventually leading to timeout) even when we use 1 thread. The only thing unusual about the cluster is the content nodes of group 0 being down. What diagnostics should we capture so we get a clue on what could be the issue?

jobergum commented 2 years ago

I can recommend the following starting points

These sample apps goes through both configuration server availability, debugging, and various health status page.

In addition, capture metrics (Both Vespa metrics but also system level metrics high memory usage, disk usage, open files etc.

jobergum commented 2 years ago

I'm resolving this, please see the linked resources.