trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.39k stars 2.99k forks source link

Trino CLI reports data with wrong units #13054

Closed metadaddy closed 1 month ago

metadaddy commented 2 years ago

Using the Trino CLI to query CSV data via the Hive connector, I noticed that the CLI is reporting the amount of data transferred incorrectly. Querying a 72.8 MB CSV file, the CLI reported that it had transferred only 69.4 MB of data.

It seems that the discrepancy arises from the fact that macOS, correctly, reports file sizes in kilobytes, megabytes, gigabytes, etc, based on powers of 1,000, with the abbreviations KB, MB, GB, while the Trino CLI is reporting the number of kibibytes, mebibytes, gibibytes etc, based on powers of 1,024. but using the KB/MB/GB abbreviations rather than the correct KiB/MiB/GiB. The math confirms this - (69.4 1,024 1,024) / (1,000 * 1,000) = 72.8.

See this wikipedia page for an explanation of the correct terms to use.

The Trino CLI should either use the KiB/MiB/GiB abbreviations or, preferably, report the number of kilobytes/megabytes/gigabytes correctly.

nineinchnick commented 2 years ago

I noticed this when working on https://github.com/trinodb/trino/pull/11160 and also reported https://github.com/airlift/units/issues/26. CLI uses the https://github.com/trinodb/trino/blob/fe608f2723842037ff620d612a706900e79c52c8/testing/trino-benchmark/src/main/java/io/trino/benchmark/FormatUtils.java#L75

So this issue is about different parts of the code, but the same problem. I believe the response in the Airlift issue would also apply here.

metadaddy commented 2 years ago

@nineinchnick I'm preparing a PR specifically in the CLI - since the output is reported to the user, rather than written to logs, so the blast radius should be smaller.

nineinchnick commented 2 years ago

@dain wdyt?

mosabua commented 2 years ago

If the implementation actually lives in the CLI, we can fix it there. If it however completely lives in airlift I think we should not add some shim in the CLI, but instead fix it in airlift. In either case we can help with reviews and input though @metadaddy

Also unrelated question .. if you are using Trino .. would you be interested to be added to https://trino.io/users.html ?

metadaddy commented 2 years ago

Thanks, @mosabua - I'm submitting a PR. As far as using Trino - stay tuned...

metadaddy commented 2 years ago

Seems like there's no desire to fix this: https://github.com/trinodb/trino/pull/13095#issuecomment-1175692277

martint commented 2 years ago

These values in the CLI are only meant to be interpreted by a human, so I think it's reasonable to change the behavior if it will make it more intuitive. It would be a different discussion if we were talking about changes to values exposed via APIs or persisted in some way that needs to preserve backward compatibility.