Performance issues with the Kiali UI

kiali / kiali

Kiali project, observability for the Istio service mesh

https://www.kiali.io

Apache License 2.0

3.4k stars 491 forks source link

Performance issues with the Kiali UI #5867

Closed RaiAnandKr closed 1 year ago

RaiAnandKr commented 1 year ago

Hey folks, apologies in advance for the large set of points/questions below (some of it are vague as well) but I wanted to talk about the general slowness with the Kiali UI

So, I have been exploring Istio + kiali for our k8s cluster since the last few months and I have grown to like Kiali a lot, especially all the integrations provided via the UI. I am slowly exposing this UI to more developers at my workplace and the feedback from them is mostly along the lines of pretty neat integrations but the UI is a bit clunky. I might be a bit biased towards Kiali since I have been using it for few months now and might have gotten used to a bit of slowness on the UI but post those feedbacks I wanted to explore this angle more.

Here are few observations from our kiali and k8s setup:

Does Kiali not expose any performance metrics? I couldn't find anything in the docs. It would be so useful in monitoring the performance of our kiali deployment as well as share those standard metrics here to report any slowness.
To be clear, it's not a big slowness and maybe what I am seeing is expected. In loading the app or versioned app graph for e.g. the UI almost always takes 6-8s. More than half of the time is spent in this one API call: https://kiali.dev.corp.arista.io/kiali/api/namespaces/graph?duration=300s&graphType=app... (to be fair, that's the main API call anyway). Attached the network activity for one of the graph refresh

Screenshot 2023-02-21 at 11 21 20 PM

Loading the list of Applications takes anywhere between 4-6s and more than half of the time is spent in calculating the health of those apps. Attached is network activity for that
We have around 1k services and 4k pods in our cluster.

I have already assessed the health of our prometheus setup via its own UI and even the k8s API server and they all look fine (the native k8s dashboard loads so much faster indicating not much issue with the k8s server).

So here are few of my questions:

Are these numbers expected probably because of the cluster size? Do they not look totally off to you? The devs would have done some performance benchmarking for sure but I couldn't find any benchmarks in the Kiali docs. I totally believe that having those benchmarks available can help set/reset expectations around performance among the users.
As I said, I don't think the underlying prometheus and k8s server is contributing much to slowness. Is there anything else I should be checking to see if it could be contributing to the slowness? Do we have some sort of checklist an user can run through to tune the performance of Kiali or help explain the performance at least.
This is more of a personal preference and a very specific opinion so feel free to discard it but I think one way to make the UI feel more slick could be to load as minimal info as possible by default. To give an example, should we be computing the health by default if I am loading the Applications page? A lot of the times I would further just click on that one app which I manage, to look into it and hence the health information of all the apps is not useful to me. Perhaps then we show it only if users click on some expand button which is expected to be slow. So, some of the workflow might require two clicks then but that makes the default state so much faster without loosing much info for most of the use cases. Maybe this feature can be controlled by a knob so that users can make a choice on if they want to see health by default or prefer the other way round.
Is performance/scalibility something which the developers are focussing on internally "right now" or do we think that we are mostly fine? I ask this because we are totally open to contributing to Kiali if the devs have identified issues/changes which can improve the performance and they could use any help with bandwidth :)

Thank you for bearing with me.

As I said, I have been running Kiali for few months now, so I have tried/tested and observed this across different versions (3 to be precise) and hence I am mostly sure that this is not something version specific.

jmazzitelli commented 1 year ago

Kiali does expose metrics. Look at the Kiali app inside Kiali itself.. there should be a tab Internal Metrics or something like that. The config is here: https://kiali.io/docs/configuration/kialis.kiali.io/#.spec.server.observability.metrics
A few sprints ago we did concentrate on perf issues. That work was completed. But of course if Kiali can be made with even better performance, the kiali team is all ears.
Yes, large environments like yours would likely cause some slowness. How many namespaces? The number of namespaces has a real effect on Kiali performance.

RaiAnandKr commented 1 year ago

Kiali does expose metrics. Look at the Kiali app inside Kiali itself.. there should be a tab Internal Metrics or something like that. The config is here: https://kiali.io/docs/configuration/kialis.kiali.io/#.spec.server.observability.metrics

neat. Do you think we should have a dedicated section in the docs on this (excuse me if it's already there) rather than needing to grok through the list of configs in Kiali CR?

3. Yes, large environments like yours would likely cause some slowness. How many namespaces? The number of namespaces has a real effect on Kiali performance.

100+

The devs would have done some performance benchmarking for sure but I couldn't find any benchmarks in the Kiali docs. I totally believe that having those benchmarks available can help set/reset expectations around performance among the users.

What do you think about this?

But of course if Kiali can be made with even better performance, the kiali team is all ears.

I understand that we are doing a lot of computations to load a page which is rich in info. Hence I think one way to go is being only as rich in info by default so that's it doesn't make the UI feel clunky and display more info only on the basis of knobs. I am saying this mostly in context of the Applications or the workloads page. Like not compute/display health by default and have a toggle for it on the UI given how slow the health computation is and individual app owners don't need to see health of all the apps. The second expensive query in loading those pages is https://kiali.dev.corp.arista.io/kiali/api/mesh/tls and that also feels like an info which we can do without in the default case. If someone has the required toggles set, we do query these info but then users are prepared and expect the page load to take 6s rather than just 2-3s in their case.

jmazzitelli commented 1 year ago

neat. Do you think we should have a dedicated section in the docs on this (excuse me if it's already there) rather than needing to grok through the list of configs in Kiali CR?

Could probably be useful to have something here, with the rest of the built-in dashboards (the Kiali metric dashboard is just one of the built-in dashboards).

jshaughn commented 1 year ago

Hi @RaiAnandKr. First, thanks so much for taking the time to write up this feedback. It's rare to get good feedback and much-appreciated. I'll make a first pass at some answers but I expect there will be a lot of back-and-forth, and other people adding comments.

@jmazzitelli already makes good points above, I'll duplicate a little of what he says and try adding a bit more...

First, make sure you're not running one of the recent versions with the perf regression. That's probably not your issue because you have been working with Kiali for a while, but just to make sure (1.62.0, 1.63.0. 1.63.1 are bad).

Does Kiali not expose any performance metrics? I couldn't find anything in the docs. It would be so useful in monitoring the performance of our kiali deployment as well as share those standard metrics here to report any slowness.

Kiali does collect some metrics although it's not super-extensive and is only timing things in the back-end, so not really looking at issues in the client or interactions between the client and server. If you navigate to the Kiali workload detail, typically in istio-system or the control-plan namespace, you should see some an additional Tab for Kiali Internal Metrics, which will offer some charts and maybe some insight.

More than half of the time is spent in this one API call: https://kiali.dev.corp.arista.io/kiali/api/namespaces/graph?duration=300s&graphType=app... (to be fair, that's the main API call anyway).

Graph generation is intensive and it's affected by a lot of things. It interacts heavily with Prometheus, k8s and istio. You;ve already said your Prometheus and K8s APIs seems to be responding well but certainly if your Prometheus is overloaded it will affect graph generation (and charts, etc). Make sure you dedicate enough resources. Also, see https://kiali.io/docs/configuration/p8s-jaeger-grafana/prometheus/#prometheus-tuning for a few tips.

The number of namespaces selected (not necessarily the total number of namespaces you have in your mesh) when generating the graph is important. In short, you end up generating a graph for each and then those graphs get "stitched" together.

Some options will increase the graph generation time because it has to do extra work. Things like response-time edge labels, the Display Security option, and a few other things.

The Traffic dropdown can speed things up by letting you eliminate TCP or HTTP traffic, for example.

The Duration dropdown is also important, the larger the time period the more date being aggregated.

Of course, sometimes it will just take time, if you have a lot of services and a lot of traffic there are just a lot of metrics to deal with. Istio reports traffic metrics from both the source and destination proxies, this reporting has to be queried, merged, etc...

...I think one way to make the UI feel more slick could be to load as minimal info as possible by default.

We primarily think of Kiali as a tool for identifying issues in your mesh, and a little less as a general console. And so we historically give health info, validations, misconfigurations a high priority, usually eating that expense. Of course we are always looking for ways to improve, and also we don't always know what's slow for different users. That's what's important about your feedback. Your suggestion is good, we can maybe look at adding more knobs and more ability to set ui-defaults in the configuration. So, for example, the default for a health knob may be On, but for your install perhaps the default could be configured to be Off.

Is performance/scalability something which the developers are focussing on internally "right now" or do we think that we are mostly fine?

To be honest, it's not a current focus. We actually went through a large period of scale and perf enhancements in latter 2021 and about half way into 2022. Both in speed and the ability to visualize large meshes (our internal target is about 100-200 namespaces and 1000-5000 services, although we don't really publish anything about scale goals because there are such huge differences in the way people deploy).

One thing we are considering is moving Kiali from a real-time system to a near-real-time system, basically precomputing almost everything and making the client a fairly thin presentation layer (and likely asking for more resources in the backend). But this is likely not happening in this calendar year.

RaiAnandKr commented 1 year ago

Could probably be useful to have something here, with the rest of the built-in dashboards (the Kiali metric dashboard is just one of the built-in dashboards).

This makes a lot of sense. Should I open a different issue to track that specific change?

Kiali does collect some metrics although it's not super-extensive and is only timing things in the back-end, so not really looking at issues in the client or interactions between the client and server. If you navigate to the Kiali workload detail, typically in istio-system or the control-plan namespace, you should see some an additional Tab for Kiali Internal Metrics, which will offer some charts and maybe some insight.

I wouldn't have been able to guess about this tab but it sounds very useful (and hence I am vouching in favour of advertising this more in our docs). So I enabled the Kiali's own metrics collection and I can see metrics in my prometheus UI but the charts on Kiali show nothing. All the below data are available and have values

but on Kiali, I see this

The same is true for kiali's traces as well where I can see the traces for kiali app on jaeger UI but not on Kiali UI. Note that I can see metrics and traces for all other apps on Kiali so I am not sure what's special about kiali own metrics/traces. It should be query the usual prometheus and jaeger endpoint configured via external_services, right?

I am running Kiali v1.61

The number of namespaces selected (not necessarily the total number of namespaces you have in your mesh) when generating the graph is important

I am trying to generate this for just one namespace to be clear and while we have thousands of services/pods, I have only onboarded ~10 different workloads to Istio. The graph generation still takes 6-8s. Maybe that's expected or maybe not , I don't know. That's why I am advocating in favour of sharing some benchmarks in the Kiali docs. I know there are a lot of variables in comparing one's data with the benchmarks but it could still be useful in setting some expectations among the users.

Make sure you dedicate enough resources.

Attached the snapshot indicating the mem and cpu usage of the prometheus pod (the usage is well within the limit requested)

Also, see https://kiali.io/docs/configuration/p8s-jaeger-grafana/prometheus/#prometheus-tuning for a few tips.

so I anyway had plans to trim the envoy-level metrics from Istio and only keep the service-level metrics, so this gives me more incentive to do that. However apart from Kiali UI, we also like some of the grafana dashboards around Istio control plane and workload/service dashboards (since we can integrate those in existing individual grafana dashboards of apps maintained by owners) and hence we can't get rid of all the metrics mentioned in the tuning guide but definitely can drop some of them. Thanks for the tip.

we can maybe look at adding more knobs and more ability to set ui-defaults in the configuration. So, for example, the default for a health knob may be On, but for your install perhaps the default could be configured to be Off.

Exactly.

One thing we are considering is moving Kiali from a real-time system to a near-real-time system, basically precomputing almost everything and making the client a fairly thin presentation layer (and likely asking for more resources in the backend). But this is likely not happening in this calendar year.

Exciting stuff. One thing I very clearly understand is that each of the Kiali pages are much more information rich than any other general dashboard (for e.g. the native k8s dashboard) and hence we are bound to do a lot of computations. So, we can either make the default state of these pages to be less information dense (like the health info we discussed above) or do the computation not so much in real time. There is only so much we can improve with the former so the latter is definitely the more exciting prospect. I know it's still far but how do I keep track of that feature? Any issue I can track?

jmazzitelli commented 1 year ago

Could probably be useful to have something here, with the rest of the built-in dashboards (the Kiali metric dashboard is just one of the built-in dashboards).

This makes a lot of sense. Should I open a different issue to track that specific change?

Yes. Open a different issue and label it as "doc" "enhancement"

So I enabled the Kiali's own metrics collection and I can see metrics in my prometheus UI but the charts on Kiali show nothing. ... Note that I can see metrics and traces for all other apps on Kiali so I am not sure what's special about kiali own metrics/traces. It should be query the usual prometheus and jaeger endpoint configured via external_services, right?

Hmm. That is strange. Obviously, you should be seeing the metrics/traces there. Nothing special. I'll take a look at that and see if I can find out if there is a problem.

jmazzitelli commented 1 year ago

I'm seeing the metrics fine, in both appliation and workload view:

RaiAnandKr commented 1 year ago

Did you test this on Kiali v1.61 (just to rule out any version mismatch)?

I have installed Kiali (and the operator too) in istio-system namespace. This is the additional config I added:

 server:
            observability:
              metrics:
                enabled: true
                port: 9090
              tracing:
                collector_url: "http://jaeger-collector.jaeger:14268/api/traces"
                enabled: true

I found out the discrepancy w.r.t traces (not sure about the code causing this though). Kiali is querying https://jaeger.xyz.io/search?service=kiali.istio-system&start=1677076955400000&limit=100 whereas when I check the Jaeger UI, the traces are getting ingested with service=kiali (and not kiali.istio-system).

Not sure what's wrong with displaying the metrics though. A sample Kiali metric in our prometheus DB:

kiali_api_processing_duration_seconds_bucket{app="kiali", app_kubernetes_io_instance="kiali", 
app_kubernetes_io_name="kiali", app_kubernetes_io_part_of="kiali", app_kubernetes_io_version="v1.61.0", 
instance="pod_ip:9090", job="kubernetes-pods", kubernetes_namespace="istio-system", 
kubernetes_node_name="node_name", kubernetes_pod_name="kiali-xyz", le="+Inf", pod_template_hash="xyz", 
route="AppList", version="v1.61.0"}

jmazzitelli commented 1 year ago

I found out the discrepancy w.r.t traces (not sure about the code causing this though). Kiali is querying https://jaeger.xyz.io/search?service=kiali.istio-system&start=1677076955400000&limit=100 whereas when I check the Jaeger UI, the traces are getting ingested with service=kiali (and not kiali.istio-system).

I'm having problems seeing traces, too. This might be an issue that needs to be fixed. I'm not sure. I think I configured it all correctly. This warrants a github issue bug report, at least for someone to investigate.

jshaughn commented 1 year ago

I also can not reproduce the charting issue for Kiali Internal Metrics. Not sure what is happening there, although those charts are not particulalry useful to determining where a specific slowness may be happening in graph generation. You could try to see if you could determine more from the prom metrics themselves. kiali_graph_appender_duration_seconds_sum can indicate if there is a specific appender taking a bulk of time, and kiali_graph_generation_duration_seconds_sum / kiali_graph_generation_duration_seconds_count can show you avg times for graph gen as a whole.

jmazzitelli commented 1 year ago

Right .. as jay infers, we don't expose every metric that kiali emits in those Internal Metrics tab in the UI - only the more basic ones. The metrics in prometheus might show you something - there are graph generation related metrics in there as jay shows (the appenders metrics is an important one).

RaiAnandKr commented 1 year ago

yeah, I am going to build a grafana dashboard using these metrics. kiali_graph_generation_duration_seconds_sum / kiali_graph_generation_duration_seconds_count is 4.8s for me btw :|

RaiAnandKr commented 1 year ago

Looking at the kiali logs for a reload of graph page, these are the queries it seems to execute

2023-02-24T06:53:59Z TRC [Prom] GetExistingMetricNames: exec time=[74.582923ms], results count=[10543], looking for count=[10], found count=[10]
enablePrometheusMerge: true
2023-02-24T06:54:02Z TRC [Prom] fetchRange: sum(rate(istio_tcp_sent_bytes_total{reporter="source",source_workload_namespace="default"}[120s])) by (request_protocol)                                                                                                                                  2023-02-24T06:54:02Z TRC [Prom] fetchRange: sum(rate(istio_tcp_sent_bytes_total{reporter="destination",destination_workload_namespace="default"}[120s])) by (request_protocol)
2023-02-24T06:54:02Z TRC [Prom] fetchRange: sum(rate(istio_requests_total{reporter="source",source_workload_namespace="default"}[120s])) by (request_protocol)
2023-02-24T06:54:02Z TRC [Prom] fetchRange: (sum(rate(istio_requests_total{reporter="source",source_workload_namespace="default",response_code=~"^0$|^[4-5]\\d\\d$"}[120s])) by (request_protocol) OR sum(rate(istio_requests_total{reporter="source",source_workload_namespace="default",grpc_response_status=~"^[1-9]$|^1[0-6]$",response_code!~"^0$|^[4-5]\\d\\d$"}[120s])) by (request_protocol))
2023-02-24T06:54:02Z TRC [Prom] fetchRange: sum(rate(istio_tcp_received_bytes_total{reporter="source",source_workload_namespace="default"}[120s])) by (request_protocol)                                                                                                                              2023-02-24T06:54:02Z TRC [Prom] fetchRange: sum(rate(istio_requests_total{reporter="destination",destination_workload_namespace="default"}[120s])) by (request_protocol)
2023-02-24T06:54:02Z TRC [Prom] fetchRange: (sum(rate(istio_requests_total{reporter="destination",destination_workload_namespace="default",response_code=~"^0$|^[4-5]\\d\\d$"}[120s])) by (request_protocol) OR sum(rate(istio_requests_total{reporter="destination",destination_workload_namespace="default",grpc_response_status=~"^[1-9]$|^1[0-6]$",response_code!~"^0$|^[4-5]\\d\\d$"}[120s])) by (request_protocol))
2023-02-24T06:54:02Z TRC [Prom] fetchRange: sum(rate(istio_tcp_received_bytes_total{reporter="destination",destination_workload_namespace="default"}[120s])) by (request_protocol)
enablePrometheusMerge: true
2023-02-24T06:54:45Z TRC [Prom] fetchRange: sum(rate(istio_tcp_sent_bytes_total{reporter="destination",destination_workload_namespace="default"}[120s])) by (request_protocol)                                                                                                                        enablePrometheusMerge: true
2023-02-24T06:54:45Z TRC [Prom] fetchRange: sum(rate(istio_requests_total{reporter="destination",destination_workload_namespace="default"}[120s])) by (request_protocol)
2023-02-24T06:54:45Z TRC [Prom] fetchRange: (sum(rate(istio_requests_total{reporter="destination",destination_workload_namespace="default",response_code=~"^0$|^[4-5]\\d\\d$"}[120s])) by (request_protocol) OR sum(rate(istio_requests_total{reporter="destination",destination_workload_namespace="default",grpc_response_status=~"^[1-9]$|^1[0-6]$",response_code!~"^0$|^[4-5]\\d\\d$"}[120s])) by (request_protocol))
2023-02-24T06:54:45Z TRC [Prom] fetchRange: sum(rate(istio_tcp_received_bytes_total{reporter="destination",destination_workload_namespace="default"}[120s])) by (request_protocol)
2023-02-24T06:54:45Z TRC [Prom] fetchRange: sum(rate(istio_tcp_sent_bytes_total{reporter="source",source_workload_namespace="default"}[120s])) by (request_protocol)                                                                                                                                  2023-02-24T06:54:45Z TRC [Prom] fetchRange: sum(rate(istio_requests_total{reporter="source",source_workload_namespace="default"}[120s])) by (request_protocol)
2023-02-24T06:54:45Z TRC [Prom] fetchRange: (sum(rate(istio_requests_total{reporter="source",source_workload_namespace="default",response_code=~"^0$|^[4-5]\\d\\d$"}[120s])) by (request_protocol) OR sum(rate(istio_requests_total{reporter="source",source_workload_namespace="default",grpc_response_status=~"^[1-9]$|^1[0-6]$",response_code!~"^0$|^[4-5]\\d\\d$"}[120s])) by (request_protocol))                                                                                                                                                                                                       2023-02-24T06:54:45Z TRC [Prom] fetchRange: sum(rate(istio_tcp_received_bytes_total{reporter="source",source_workload_namespace="default"}[120s])) by (request_protocol)

I ran most of these queries by hand in prometheus UI multiple times and they seem to be taking ~250ms. So "assuming" a lot of these queries are being done in parallel, it's surprising to see the graph generation taking 4s+

I am also exploring tuning the external_services.prometheus.cache_duration. 10s seems like too low of a default for my setup where we have the prometheus scrape interval set to 1 min. Also, It's surprising to not find any mention of this cache_duration config in the prometheus tuning guide at https://kiali.io/docs/configuration/p8s-jaeger-grafana/prometheus/#prometheus-tuning

external_services.prometheus.cache_duration and kubernetes_config.cache_duration feel like configs which can give some respite from the slowness. Going to try them..

RaiAnandKr commented 1 year ago

I added

+ kubernetes_config:
+            cache_enabled: true
+            cache_duration: 120

but any DestinationRule or VirtualService addition/deletion in the k8s cluster is instantly reflected on the Kiali UI both in graph (via the icons) as well as the Istio Config page. So in these cases at least, no cache is used. Am I interpreting the usage of these configs wrong?

I also toggled prometheus.cache_duration config to 120 but this query: https://kiali...io/kiali/api/namespaces/default/apps?health=true&rateInterval=300s on multiple reloads takes ~2s with or without the config (i.e. with the default prometheus.cache_duration= 7) The kiali docs indicate that the health queries are somewhat cached: Kiali maintains an internal cache of some Prometheus queries to improve performance (mainly, the queries to calculate Health indicators), so I was expecting that with longer prometheus cache_duration, the second reload onwards would be faster for the next 1-2 mins but that's not happening.

jshaughn commented 1 year ago

Given the Prom query response times I'm guessing the slowness is more related to the interactions with k8s, when appending information. Look at that kiali_graph_appender_duration_seconds_sum metric more carefully, especially the istio appender times.

RaiAnandKr commented 1 year ago

but any DestinationRule or VirtualService addition/deletion in the k8s cluster is instantly reflected on the Kiali UI both in graph (via the icons) as well as the Istio Config page. So in these cases at least, no cache is used. Am I interpreting the usage of these configs wrong?

So John clarified this in the istio slack channel (I had a related thread opened there). The cache gets updated when things change due to some "k8s watches". I was doing some benchmarking today after upgrade to Kiali v1.64. I found the Applications/workloads/services pages to load faster (compared to v1.61 I was running since 2 months) and that's probably due to the internal use of k8s cache? I just didn't know that we are using that internally by default, so I was setting/unsetting kubernetes_config.cache_enabled and seeing the same speed and was confused :)

RaiAnandKr commented 1 year ago

Given the Prom query response times I'm guessing the slowness is more related to the interactions with k8s, when appending information. Look at that kiali_graph_appender_duration_seconds_sum metric more carefully, especially the istio appender times.

Here are few metrics: 1. kiali_prometheus_processing_duration_seconds_sum/kiali_prometheus_processing_duration_seconds_count For query_group="Graph-Generation" - 0.0075 For query_group="Metrics-GetRequestRates" - 0.0061

This surely confirms that prometheus is fine.

2. kiali_graph_appender_duration_seconds_sum/kiali_graph_appender_duration_seconds_count For appender="deadNode" - 0.00001 For appender="istio" - 1.96 For appender="serviceEntry". - 0.0097 For appender="sidecarsCheck"- 0.056 For appender="workloadEntry"- 0.001

3. kiali_graph_generation_duration_seconds_sum/kiali_graph_generation_duration_seconds_count - 4.34

4. kiali_graph_marshal_duration_seconds_sum/kiali_graph_marshal_duration_seconds_count - 0.0008

5. kiali_api_processing_duration_seconds_sum/kiali_api_processing_duration_seconds_count (adding expensive ones which could be relevant to graph page reload) For route="GraphNamespaces - 4.1 For route="NamespaceTls - 1.12 For route="NamespaceValidationSummary" - 1.15 For route="IstioCerts" - 0.3 For route="IstioStatus" - 0.55

For route="AppList" - 1.7

Two other expensive queries which popped up kiali_checker_processing_duration_seconds_sum/kiali_checker_processing_duration_seconds_count For checker="checkers.TelemetryChecker" - 5.2 For checker="checkers.WasmPluginChecker". - 5.5 (rest all the checkers take less than 0.001s)

I have few questions:

(the most obvious one) - what conclusions are you drawing?
The graph generation is taking ~4s and 2s of it is spend by appenders (istio appender is taking most of it basically). Is the remaining 2s being spent on whatever else is done in graph generation expected?
What all does appender=istio include?
Apart from the API to get graph, the APIs for route="NamespaceTls and route="NamespaceValidationSummary" are expensive as well. Is that expected?
Can we do anything about these especially from a k8s POV? Does tuning the k8s cache configs help here?

RaiAnandKr commented 1 year ago

cc @jshaughn if you could guide me further based on these stats.

jshaughn commented 1 year ago

I'm not exactly sure what to recommend at the moment. It seems true that the istio appender is one reason for the slowness. This appender queries k8s but the information is typically cached and it should not result in a major hold up most of the time (unless we have a caching issue). For what it's worth, this appender decorates the graph with a variety of information that we get from the configuration, it's what detects virtual services, circuit breakers, gateways, etc. I'm not exactly sure where the other 2s is spent. I'm a little curious what you would see if you took the appenders out of the equation.

In your browser or from a command line try (this may work as-is only if you are using anonymous auth):

This one has the appenders:

http://YOURHOST:YOURPORT/kiali/api/namespaces/graph?duration=600s&graphType=workload&appenders=deadNode,istio,serviceEntry,sidecarsCheck,workloadEntry,health&namespaces=YOURNAMESPACE

This one does not:

http://YOURHOST:YOURPORT/kiali/api/namespaces/graph?duration=600s&graphType=workload&appenders=&namespaces=YOURNAMESPACE

The second one should only interact with prometheus.

jshaughn commented 1 year ago

Hi @RaiAnandKr , any more feedback?

jshaughn commented 1 year ago

This is a fairly broad issue that touches on several points. Initially I'm just going to look at possibly offering a few "knobs" for controlling whether the list views can optionally pull some of the information that can be time-intensive. This would at least allow users that mainly want to use the list view for "inventory", and not necessarily as a dashboard for say, health or validations, can do so efficiently, deferring to the detail pages for more complete information.

nrfox commented 1 year ago

@RaiAnandKr it looks like you've got tracing enabled for the Kiali server. When you look at the Kiali app in the Jaeger UI, do you see any particular function calls that are taking the most time? You should be able to filter by endpoint e.g. /graph or /mesh/tls. IMO looking at Kiali's traces is one of the best ways to diagnose exactly what is causing the slowdown on the backend. The traces are not plumbed everywhere but generally they should give a good picture of what is slow. We should document this more and maybe more generally document how to identify/diagnose perf issues with Kiali.

There's been other reports of the /mesh/tls endpoint being very slow going all the way back to v1.36: https://github.com/kiali/kiali/issues/4224 so probably that area alone would be good to investigate.

wrt to the kubernetes cache, Kiali actually just recreates the entire cache any time you do an update through the UI. I'm curious if you see any slowness for Kiali loading after a write opreation? I don't think adjusting the kubernetes cache options will help much with performance. The kubernetes_config.cache_duration option sets the resync interval on the informers which might show you more or less stale kube config but it probably wouldn't improve perf at all since the resync happens in the background.

RaiAnandKr commented 1 year ago

Hey Folks, very sorry for leaving this thread abruptly without much response. I got distracted by a bigger issue w.r.t our istio setup and couldn't continue on this issue. I will pick this up again this week and get back.

RaiAnandKr commented 1 year ago

Since @jshaughn is already working on changes to make the list view of Applications/Workloads configurable (and there is no more we can do here, thank you @jshaughn ), I will keep this focused on the delay in generating the graph.

This one has the appenders:

http://YOURHOST:YOURPORT/kiali/api/namespaces/graph?duration=600s&graphType=workload&appenders=deadNode,istio,serviceEntry,sidecarsCheck,workloadEntry,health&namespaces=YOURNAMESPACE

This one does not:

http://YOURHOST:YOURPORT/kiali/api/namespaces/graph?duration=600s&graphType=workload&appenders=&namespaces=YOURNAMESPACE

The second one should only interact with prometheus.

The one without the appenders take 250-300ms. The one with appenders take 2.5-3s. If we are querying k8s every time, then I could expect a delay of 2s. but if we are really using k8s cache, should it be taking this long? I tried multiple consecutive refreshes of the one with appenders and it always take 2.5-3s.

I looked at the traces as @nrfox suggested to see all the API calls made during loading graph for a namespace. While it didn't give anything we don't already know from the stats or just observation, it did make me even more sure of some of the slow API calls apart from the main one above

api/mesh/tls always takes 1-2s and it returns just
```
{"status":"MTLS_NOT_ENABLED","autoMTLSEnabled":true,"minTLS":"N/A"}
```
in my case. Do we not cache this one? We make this slow API call for every graph I try to load for any namespace. But the response looks like something we can cache for a much longer duration if we want since those global value will rarely change

api/istio/status always takes 0.5-1s. and in my case returns

[{"name":"istio-egressgateway","status":"NotFound","is_core":false},{"name":"istio-ingressgateway","status":"NotFound","is_core":true},{"name":"istiod-1-16-3-6b7f84749d-dr78k","status":"Healthy","is_core":true},{"name":"grafana","status":"Unreachable","is_core":false}]

These 2 calls and the api/namespaces/graph?... combine to give me a load time of ~6s for generating graph for a namespace with just 3-4 workloads.

jshaughn commented 1 year ago

@RaiAnandKr , sorry for the slow response, it doesn't mean we're not interested, just busy. We will look more closely into the calls you have identified and see if we can better understand those slow times.

On the bright side, I just merged the changes for the configurable lists, hopefully it helps with your issue. It will be in v1.67, and in the Kiali CR you will be able to configure things under spec.kiali_feature_flags.ui_defaults.list, toggling off various data/columns.

That merge closed this issue. It would be best to to continue in a new issue. Would you mind creating a new issue, it can refer back to this one, with the content of you most recent comment about the slow calls?

jshaughn commented 1 year ago

@RaiAnandKr , Not sure if you are still active with Kiali but a couple of comments about the two API calls you singled out above:

api/mesh/tls always takes 1-2s and it returns just

There have been a variety of changes in the logic that may have affected/improved performance.

api/istio/status always takes 0.5-1s. and in my case returns

This call can take time and takes longer if a component is actually down, or is slow to respond. A current workaround is to disable those component checks by setting in the Kiali CR:

.spec.external_services.istio.component_status.enabled: false

But, both of these API calls support information in the masthead and are made on any refresh, a graph update or almost anything else as well. I think we can make some improvements here but I don't think the response times of these API calls should actually be affecting your graph update. Those results should arrive independently. We may actually be removing that information from the masthead in the near future, and moving it to a more rich, "Mesh" page, that details your control plane(s), clusters etc.

If you are still using Kiali, and can try a recent version, it would be great if you could again baseline some of your perf numbers, as there have been large changes in the code, including caching, as we have restructured things for multi-cluster support. Thanks!

hhovsepy commented 4 months ago

@RaiAnandKr if you are still using Kiali, significant performance improvements have been made starting with Kiali v1.85. For more details, visit: Kiali Performance FAQ