trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.42k stars 3k forks source link

Trino dev ex, education, and small local uses design proposal #23344

Open bitsondatadev opened 1 month ago

bitsondatadev commented 1 month ago

Tools like DuckDB and pySpark have had a lot of success due to the options for fast onboarding user experiences. Python and Rust communities have further simplified the ways in which you can quickly install, or even entirely skip the installation process to make local execution and embedded tutorials easier with fewer dependencies. I'd like to propose a design to justify and develop MVP features that will make Trino much simpler and portable to make the beginner, local, and onboarding Trino experience better.

Proposed features

MVP features

  1. "Native" single-node with CLI client installation Similar to the default Docker image but without Docker required. It's also similar to Spark's standalone installation scripts. I put "native" in quotes because the experience should feel native, but will still require the next MVP feature.

    Why?

    Having a CLI that has a little single-node Trino cluster is lightweight and gets users to understand the value of Trino faster. This opens up opportunities for local database use cases, and catches Trino up with recent incredibly popular smaller analytics databases like DuckDB which took off due to its simplicity, portability and features aimed at researchers, academics, and analysts who prefer pragmatic or "friendly" SQL, integrations with Python,R,Arrow, and perhaps most importantly WebAssembly to integrate into browsers. While it's safe to say we don't want to create arbitrary SQL standards, there are plenty of ways in which we can incorporate those features like import local files through [table functions](https://trino.io/docs/current/functions/table.html) and integrating well with Python, ADBC, and other libraries, while keeping to the ANSI specification.

  2. JRE install/management automation

    • Similar to what pySpark does, except I think we should have a self-contained JVM that doesn't depend on the environment Trino is running in.
    • This will likely be added in the launcher.py script.
    Why?

    Rather than wasting effort on searching and depending on some other JRE with god-knows-what-settings, we should just pull down a minimal instance depending on the user's needs.

  3. Lightweight proxy management service

    • Spark offers start/stop scripts which we will want for Trino, with minimal documentation like the launcher scripts which should be written for the use of clients and service utilities rather than the intended way for users to interact with a cluster.
    • Create the config directory relative to the launcher script location and expose fewer options than to lower user burden with great defaults.
    Why?

    Spark-submit which calls spark-class is a common way to interface with Spark without needing upfront knowledge. This however, requires Spark to spin up and spin down Spark on every operation unless they already have a cluster running which is currently what is required for Trino. There are options to keep the Spark session alive longer but it's buried under Sparks abundant config and feature choices. I prefer the developer experience to be closer to DuckDB's connection where your only option is in-memory versus specifying the location of your database. Trino's architecture still requires some of what Spark has, so I would instead propose we aim to install a thin proxy service that hides the complexity but has a DuckDB look and feel.

V1 Features

  1. WebAssembly (WASM) compatibility:

    Note: This full version of WASM compatibility should happen in another issue, but I want to have this here to discuss some temporary workarounds that might work until WASM adds more features.

    Why?

    WASM enables us to run systems within the browser like DuckDB and would enable a vast amount of education and docs embedding for Trino and run in most recent browsers supporting WASM. This is commonly done through building Rust implementations with Python bindings, both of which have popular compilers like wasm-pack. WASM is missing certain features that Trino needs such as Threads, Reflection, Garbage Collection, among other features that may stifle full use of Trino on WASM. One option to consider would be a simpler code path for Trino that if passed a certain flag, would avoid the use of certain features like threading, single node option of Trino doesn't actually require much. This would be something we could maintain while waiting for more capabilities of WASM to unfold.

    WASM details for future issue

    The WASM team offers a compiler framework tool called [binaryen](https://github.com/WebAssembly/binaryen) which aims to provide some standardization for compiler builders and is valuable to be aware of. Here are a list of current Java to WASM tools: - [J2CL](https://github.com/google/j2cl) (Apache v2) is an active Google Java to Javascript/WASM tool used and primarily contributed to by the Chromium team. They use WASM's binaryen tool, it supports a lot of core Java 8 features, and of course doesn't support some based on limitations WASM, [mainly reflection](https://github.com/google/j2cl/blob/master/docs/limitations.md), but threads may work. - [TeaVM](https://teavm.org) has an interesting and rebellious approach to compiling Java to WebASM. It re-implements the [JVM classes itself](https://teavm.org/docs/runtime/java-classes.html) which feels both pragmatic but dangerous as an adopter until they have enough momentum. They have even implemented some portions of the [Java reflection classes](https://teavm.org/jcl-report/recent/packages/java.lang.reflect.html). I wonder if there would be a way to combine both methods of utilizing a fully-implemented JVM and T - [JWebAssembly](https://github.com/i-net-software/JWebAssembly) transpiles the core Java8 features and is under the permissive license, but it isn't using WASM binaryen tool and also doesn't implement Reflection. - There is [cheerpj](https://cheerpj.com/docs/overview) which seems to have a fully rewritten JVM similar to TeaVM with a lot of our needs, but it is closed-sourced but is probably doesn't cost money to use [(see ambiguous licensing)](https://github.com/konsoletyper/teavm/tree/master/classlib/src/main/java/org/teavm/classlib/java). Maybe the Trino Software Foundation could use it. - The [Wasmer Java package](https://github.com/wasmerio/wasmer-java) is interesting but in a rough state as it depends on the [wasmer-jni](https://github.com/Salpadding/wasmer-jni), both of which haven't had development since late 2022. Plus Wasmer is an entire framework that doesn't seem to have many Java users. Note: You may find [GraalVM WASM](https://www.graalvm.org/latest/reference-manual/wasm/), but this is focused on embedding a WASM engine in Java not compiling Java to WASM.

  2. WSL2 compatibility for small Trino installs on Windows installations.

Local Developer Experience Features

  1. Dynamically CREATE/DROP/and more to catalogs without needing to restart Trino.
  2. Read/Write text/binary files from local storage as is done in trino-storage. This could also be done using table functions.
  3. Read/Write local partitioning but perhaps utilizing the Iceberg standard but for a local filesystem. Perhaps make the default persistent storage Iceberg with a JDBC catalog pointing to sqlite and localfs for Parquet.
  4. Dynamically infer schema from column values as done in DuckDB and Spark.

UX in different environments

Core local install (Behind the scenes used by each environment)

  1. Download the Trino binary of the latest binaries.
  2. Untar and output to a directory called ./trinod
  3. Download a JRE version compatible with the Trino server on either macOS or Linux and eventually WSL2 but initially throw an error when running in Windows.
  4. Run JRE and set a Maximum Heap Value, use G1GC settings to release memory back to the OS when not in use (see video).
  5. Do an eager run of Trino to finalize the installation.
  6. Run on coordinator as coordinator and worker, maybe allow local cluster at some stage if that's worth as a model.

The lifecycle will likely rely on a to check whether or not to keep the daemon running based on how long since the last daemon ran and the configured daemon-ttl is set for.

Configuration

Catalogs

Do we add some out of the box automatically? Like memory, tpch, tpcds, local - where local is an Iceberg catalog.

Python

Local Install

CLI

Local Install

Open questions

mosabua commented 1 month ago

Also related to the packaging improvements plans in #22597

huw0 commented 1 month ago

JRE install/management automation

Possibly another option might be a trino build using GraalVM's native-image, so that no JVM is needed? Given GraalVM is being used for Webassembly, this is hopefully low-effort. By doing reflection at build time, this will also make Trino startup almost instant which will provide a much better local experience.

A native-image for trino-cli has other obvious benefits such as enabling cli deployment without separate JRE.

How to handle port management in the proxy service?

One option might be to enable Trino to listen on a Unix socket and avoid port management entirely? Unix sockets now work on recent versions of Windows too (since 2017).

bitsondatadev commented 1 month ago

JRE install/management automation

Possibly another option might be a trino build using GraalVM's native-image, so that no JVM is needed? Given GraalVM is being used for Webassembly, this is hopefully low-effort. By doing reflection at build time, this will also make Trino startup almost instant which will provide a much better local experience.

A native-image for trino-cli has other obvious benefits such as enabling cli deployment without separate JRE.

WASM itself doesn't support a lot of Java8 features specifically needed to run a Trino Server, namely threads (shared everything?), garbage collection, and reflection types/relaxed dead code.

So until WebASM comes further along, there's not many full-proof and open solutions to do this. As I mention in some of the Issue notes (it's buried) TeaVM has the most potential for developing a hack in the meantime. The hack is, it re-implements the JVM classes itself which feels both pragmatic but dangerous as an adopter until they have enough momentum. They have even implemented some portions of the Java reflection classes.

I don't think it will be worth doing the hack over just downloading and managing our own JRE. The main reason I want a web assembly compiler option would also enable us to run Trino servers in the browser. If WebASM moves very slowly for all the features we would need and we all really want the in-browser option, I'd say it's something worth exploring.

Until then, local + JDK management is the way IMO.

How to handle port management in the proxy service?

One option might be to enable Trino to listen on a Unix socket and avoid port management entirely? Unix sockets now work on recent versions of Windows too (since 2017).

This would be something I'd want to ask @dain or @electrum about. I don't know enough about unix sockets to have an opinion, or know if there are use cases where this could bite us.

Update: We seem to use unix timestamps in the HDFS storage settings, but adding an option for Unix sockets in place of the HTTP(S) setting to connect to Trino nodes would muddy up the code for this part. I'd rather keep the code using the same path IMO. Still, let's see if Dain/David have another opinion.