trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.22k stars 2.95k forks source link

Expose physicalInputDataSize in Presto CLI/JDBC client #4863

Open hashhar opened 4 years ago

hashhar commented 4 years ago

The WebUI shows both physicalInputDataSize and inputDataSize today.

This issue is to add similar functionality to the Presto CLI (and the JDBC QueryStats/StageStats classes too).

~As part of this change we can also explore if inputDataSize makes sense to expose to clients due to it's limited usefulness.~

cc: @sopel39 @electrum

sopel39 commented 4 years ago

inputDataSize is still useful as it is base for CBO decisions.

Damans227 commented 2 years ago

Hi @hashhar I started working on this issue. Just to be sure, the queries highlighted in the screenshot below are the ones we want to show via the CLI too ?

Screen Shot 2021-10-27 at 8 58 44 PM

I am currently reviewing the frontend - ReactJS files likecore/trino-main/src/main/resources/webapp/src/components/QueryDetail.jsx to figure out where the property data is coming from. My guess is, the web-ui is mapped to some kind of an API endpoint to show these properties.

Do you know what would be the source of data for the CLI ? I am yet to explore the trino/client/ dir. when you have a minute or so, would you be able share the Java file names I should start with to approach this issue ? I am trying to understand the workflow of how properties make it to the CLI at first place. So, any help to get me started on this issue quickly will be appreciated!

Thanks!

hashhar commented 2 years ago

@Damans227 Thanks for picking this up.

To get started you should take a look at StatusPrinter class in Trino CLI. That class accesses data from a few "models" like StatementStats, QueryStats and StageStats (from io.trino.client).

You can see who is responsible for adding data to these models by looking at the callers of their constructors (which are limited in number so shouldn't be too hard).

For the JDBC driver there are matching classes (StatementStats, QueryStats and StageStats) in io.trino.jdbc which are exposed to users of the JDBC client as methods. So for JDBC it should be good enough to ensure that information from the io.trino.client version of these classes ends up inside the io.trino.jdbc version of these classes too.

Feel free to ask more if you have questions either here on the Slack at #dev.

Damans227 commented 2 years ago

@Damans227 Thanks for picking this up.

To get started you should take a look at StatusPrinter class in Trino CLI. That class accesses data from a few "models" like StatementStats, QueryStats and StageStats (from io.trino.client).

You can see who is responsible for adding data to these models by looking at the callers of their constructors (which are limited in number so shouldn't be too hard).

For the JDBC driver there are matching classes (StatementStats, QueryStats and StageStats) in io.trino.jdbc which are exposed to users of the JDBC client as methods. So for JDBC it should be good enough to ensure that information from the io.trino.client version of these classes ends up inside the io.trino.jdbc version of these classes too.

Feel free to ask more if you have questions either here on the Slack at #dev.

This is very helpful. Thanks as always!

Damans227 commented 2 years ago

@hashhar Hi! I tried wrapping my head around the StatusPrinter class today. It seems like this class prints out the query info at 2 different stages of the query execution i.e., FINISHED, and RUNNING. Another key finding was that the detailed query information like the example given below, is only printed when --debug flag is set on the trino command:


Query 20211103_011509_00016_k8hi7, FINISHED, 1 node
http://localhost:8080/ui/query.html?20211103_011509_00016_k8hi7
Splits: 20 total, 20 done (100.00%)
CPU Time: 0.0s total, 6.25K rows/s,     0B/s, 80% active
Per Node: 0.0 parallelism,   101 rows/s,     0B/s
Parallelism: 0.0
Peak Memory: 0B
0.25 [25 rows, 0B] [101 rows/s, 0B/s]

So, I guess, before I dive deeper, it will be helpful to know when do we want to expose the physicalInputDataSize property ? Is it during the FINISH state or RUNNING state ? Also, do we want to show it only when --debug is set ?

Thanks!

hashhar commented 2 years ago

When I created this issue the intent was to print it alongside the processedBytes. i.e. whenever processedBytes are shown physicalInputDataSize should also be shown. IIRC they get printed both with and without debug, during progress and at the end of query too.

I believe you can search for processedBytes in the StatusPrinter class to find all places where it gets printed.

cc: @electrum In the CLI should we:

  1. Replace the processed bytes (and rate) with physical input data size?
  2. Add physical input data size alongside processed bytes (and rate)?

@Damans227 I think it'd be smarter to start with the JDBC driver change since there are no such decisions to be made there. Sorry for not anticipating it earlier.