prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.96k stars 5.35k forks source link

Presto worker aware user interface for co-ordinator. #21278

Open amitkdutta opened 11 months ago

amitkdutta commented 11 months ago

Presto coordinator UI is a great tool to debug presto cluster's health, running queries, worker status and so on. Presto now supports native worker, which acts as a drop on replacement of the java worker. It will be great if we have a slight hint in coordinator UI that tells what kind of workers corresponding presto clusters have. The change in UI can be something that is easily noticeable, but not a dramatic change. We can dynamically make the change in the UI based on worker type where java worker keep the current design, while native worker gets a slight different one.

This change will allow developers to immediately figure out what workers are running in the cluster, instead of searching specific stats in a query or seeing worker logs.

How does the coordinator knows the worker type? Currently the easiest option is if native-execution-enabled=true, the cluster is running native worker. Otherwise it's running java workers. While the procedure might change how coordinator discovers it, coordinator UI can depend on an endpoint in the coordinator that will tell what kind of worker is running. Workers can also publish there types via announcement and coordinator can keep track of it.

The main ask here is to add the UI customization based on coordinators response to worker type endpoint.

CC: @aditi-pandit @tdcmeehan

mbasmanova commented 11 months ago

Add powered by velox with a velox logo at top

I like this idea. I think we have space before or after CLUSTER OVERVIEW. @amitkdutta Would you like to prototype something and share screenshots here to help us decide?

Screenshot 2023-10-31 at 7 10 40 AM
tdcmeehan commented 11 months ago

How about adding a tag, where the tag is something that is generic? The tag could have a similar UX to Github tags that we see on this issue.

Just thinking this through, while doing something special for Velox will make it easy to differentiate native and Java clusters, the same need could be had in other instances:

"Powered by Velox" is cool and accurate for native deployments, but seemingly equally accurate could be "powered by Iceberg" for those that use Iceberg to help with transactions, indexes and time travel, or "powered by Spark" if you're using Presto on Spark to power your batch jobs. I think it makes sense to just keep it generic and let the tag be anything the person who is deploying Presto wants it to be.

spershin commented 5 months ago

Continuing this topic...

I see we might have different flavors of "Powered by". I also see that we might have more than one of these per installation. For example: we have have native workers and Iceberg, right? So, two or more icons might be needed.

I'm looking at the 'NodeVersion' and seeing that it is being used in two structures on the worker: NodeStatus and ServerInfo. I assume that one of these messages is being used by the UI to pull the info from the Coordinator as well.

The worker is also composing announcement message, which has this 'NodeVersion'.

Alonf with this field there are "String environment" and "bool coordinator".

So, how about we upgrade all these structures and add a field that would be populated by the appropriate data? It could be a simple comma-separated string or whatever is more convenient.

With this new field:

  1. Presto Java Coordinator would always add "java" to it.
  2. Native worker would always add "native" to it.
  3. Presto Java Coordinator which runs with Native Workers would always add "native" to it.
  4. As example, if Iceberg is used, Presto Java Coordinator would also add "iceberg" to it.
  5. Etc.

Then, on UI we can parse the field and add icons accrodingly.

WDYT?

@tdcmeehan @mbasmanova @amitkdutta

tdcmeehan commented 5 months ago

@spershin that would require connectors to be associated with icons, which seems like an onerous requirement for plugin authors. If we want icons in the UI, can't we just allow people to self-specify what icons they want in a resource folder available to the coordinator?

spershin commented 5 months ago

@spershin that would require connectors to be associated with icons, which seems like an onerous requirement for plugin authors. If we want icons in the UI, can't we just allow people to self-specify what icons they want in a resource folder available to the coordinator?

Icons can be optional/hardcoded, etc. We can only have icons for "java" and "native", the rest can be just a text.

I, honestly, don't want to complicate this and would simply add the bool isNative to NodeStatus and ServerInfo and then modify UI to use them. It is becoming quite important for development.

Connectors - not sure we really need to expose that at all.

Going to work on some simple prototype.

tdcmeehan commented 5 months ago

@spershin wouldn't it be even less complicated to just allow people to add a tag as suggested above? And wouldn't it be even more flexible to account for all of the ways we don't know how people use Presto, as described above?

tdcmeehan commented 5 months ago

So for example, it's important to you that you know if a cluster is Java or C++, because you're migrating lots of clusters, and there is cognitive overhead by looking this up offline.

We add a field in our server properties:

coordinator.ui.tags=adhoc,cpp

Add the tags to ClusterStatsResource. Write a little UI code to display it, and we're done. It not only solves this problem, but it also makes it clear for other scenarios I listed above, for example batch vs. interactive vs. adhoc, or data application vs. lakehouse query, etc.

spershin commented 5 months ago

@tdcmeehan

wouldn't it be even less complicated to just allow people to add a tag as suggested above? And wouldn't it be even more flexible to account for all of the ways we don't know how people use Presto, as described above?

I believe, it will be more complicated.

  1. People will have to specify the tag. Where? In the config?
  2. What if they make mistake in the config? Then the system runs java and people see 'native'. This can cause a lot of damage.
  3. What do we do with the tags?
  4. Show them in some "tags" section of UI?
  5. Do we need to take care of the multiple tags at once?

It seems to me you are proposing a completely different feature for UI, which should be done separately.

Here we are talking about a fundamental property of Presto - it is running Java engine or Velox/Native one. It will be either one or the other and the distinction can be easily incorporated on the UI.

I suggest not to mix two proposals.

tdcmeehan commented 5 months ago

@spershin

People will have to specify the tag. Where? In the config?

Yes, in the config.

What if they make mistake in the config? Then the system runs java and people see 'native'. This can cause a lot of damage.

This is the same config that specifies native or Java currently. Production deployments will almost certainly use orchestration tools and scripts to automate this.

What do we do with the tags?

Show them in the UI, which is the outcome you want.

Show them in some "tags" section of UI?

Probably in the top panel, not hard.

Do we need to take care of the multiple tags at once?

I don't understand what the question is, it's a list that you can display.

I went through your objections and disagree it's a completely different feature, I think it solves the issue as written above and also accomplishes the outcome you want. My point, once again, is there are many "fundamental properties", and I don't think you can condense that down into one value, so I think it's a better idea for people to self-specify what's important. It's definitely not a different proposal.

spershin commented 5 months ago

@tdcmeehan

You can, of course, disagree with my arguments.

The property one specifies in "coordinator.ui.tags" (that you proposed) can be different from the one we already have to determine if we are running Presto Native ("native-execution-enabled").

Having two of them detached can have confusing effect on developers and users.

Because one of them is actually controlling functionality and cannot be wrong ("native-execution-enabled") and another one is controlling the UI and can be wrong.

Keeping both in sync is the one point of failure and is a chore.

tdcmeehan commented 5 months ago

@spershin, let’s take a step back and enumerate the goal we’d like to achieve and the way we would like to reach that goal.

The goal is we want a way to differentiate the UI between various deployments of Presto. For people who deploy large fleets of Presto, this can be useful because you want a way to understand at a glance what are certain important details of the cluster without looking it up elsewhere. While we do have the version string, which may be arbitrarily differentiated, we want something more obvious in the UI.

Your proposal is to wire this into the UI and provide this information to all users of Presto. My hesitation with this proposal is there isn’t a need for this differentiation to be broadcast to the entire Presto user base—people are running in Java, and there is no public plan for deprecating the Java eval engine and adding first class support for the C++ eval engine. Before we expose lots of differentiation to users, I think it’s important to understand what the plan is for deprecation, what is the plan and timeline is for moving the community over to the new eval engine.

My counter-proposal is this: whereas it’s important for you to understand the underlying eval engine for your current cluster by cluster migration effort, this similar sort of thing actually could be useful for the community at large in a more generic way. If a user separates Presto cluster for any reason, it would be useful to get an at a glance understanding of what is the difference. I described examples above, but in a similar way that operators would want to easily know that an eval engine is different between two deployments within a fleet, they would also want to know the purpose of the cluster between two deployments within a fleet (such as interactive vs. batch), as even within the same eval engine there could be significant changes in how Presto is operated.

It seems that the objection is that by making this differentiation generic, we would need a new config flag. That is unavoidable, as only the operator of Presto can summarize the customizations to a configuration and condense it into a single word or tag that describes it. I would argue this accounts better for the diversity of usage within the broader community.

Regarding the specific concern that adding a new flag introduces room for error, I agree that this is a very big problem currently. Right now in C++, there are flags that must be set in order to use Prestissimo, such as use-alternative-function-signatures, optimizer.optimize-hash-generation, Regex library must be RE2J, decimal literals must be parsed as doubles, etc. These actually will cause failures in Prestissimo clusters if they’re misconfigured, and they’re not documented as being required for support.

I view this as being less important than the above, because the information in the tags are not critical for the operations of the cluster, and there are other sources of visibility in order to understand the cluster’s configuration. For really mission critical systems, orchestration frameworks are used which will guarantee consistency for when it’s important. As an example, consider the Helm chart for the coordinator:

presto-helm-charts/charts/presto/templates/configmap-coordinator.yaml at main · prestodb/presto-helm-charts

Keeping it in sync would be as simple as defining in the Helm chart which eval engine to use, and then writing something along these lines in the coordinator config (see last two lines):

config.properties: |-
    coordinator=true
    http-server.http.port={{ .Values.service.port }}
    discovery.uri=http://{{ .Release.Name }}-discovery:{{ .Values.service.port }}
    discovery-server.enabled={{ or (eq .Values.mode "single") (eq .Values.mode "cluster") }}
    node-scheduler.include-coordinator={{ eq .Values.mode "single" }}
    native-execution-enabled={{ .Values.nativeEval }}
    coordinator.ui.tags={{- if eq .Values.nativeEval }}"Powered by Velox"{{ end }}

I imagine there would be a similar tool for your internal deployments? Configuration management is extremely important, which is why any sophisticated deployment would be expected to check in configurations and also provide some minimal amount of orchestration to ensure correct values. While minimizing configurations is a good goal, it shouldn’t come at the cost of altering what we build significantly.

spershin commented 5 months ago

@tdcmeehan

The goal we want to achieve here is very simple: from UI to quickly identify the flavor of Presto: Native or Java. That is it. Nothing else.

Your idea to make it even broader is quite good too, I see its benefit, especially for the interactive vs batch vs something else. It can (and should) be implemented via the tag you mentioned. However, it has nothing to do with what we want to achieve here - to quickly understand the biggest differentiator of Presto installations: Native/Java.

Thus, I propose to go ahead with the separate approach for Native/Java and any possible tags folks want to imbue their clusters with.

Since you agreed this is a big problem already (meaning room for error due to need of different settings to have synchronized values), then I suggest we do not add to this problem. Saying that there is already a mess in the room and using it as excuse to add even more is not the best path forward.

You believe that potential discrepancy is not being important and we can understand the nature of the cluster by other means. I completely disagree here - the whole proposal is just to solve that exact problem and indicates that it is quite important. Moreover, our experience working with Native & Java for few years gave people this idea because it is a problem.

tdcmeehan commented 5 months ago

@spershin let's sync up in the working group meeting, or schedule a 1:1 with me, if you'd like to discuss further, my points don't seem to be coming across and I think it would be easier to discuss in person. Thanks!

spershin commented 5 months ago

@tdcmeehan

Let's discuss this on 1:1. Does slack support something like this or just a chat?