Open bitsondatadev opened 1 month ago
Also related to the packaging improvements plans in #22597
JRE install/management automation
Possibly another option might be a trino build using GraalVM's native-image, so that no JVM is needed? Given GraalVM is being used for Webassembly, this is hopefully low-effort. By doing reflection at build time, this will also make Trino startup almost instant which will provide a much better local experience.
A native-image for trino-cli has other obvious benefits such as enabling cli deployment without separate JRE.
How to handle port management in the proxy service?
One option might be to enable Trino to listen on a Unix socket and avoid port management entirely? Unix sockets now work on recent versions of Windows too (since 2017).
JRE install/management automation
Possibly another option might be a trino build using GraalVM's native-image, so that no JVM is needed? Given GraalVM is being used for Webassembly, this is hopefully low-effort. By doing reflection at build time, this will also make Trino startup almost instant which will provide a much better local experience.
A native-image for trino-cli has other obvious benefits such as enabling cli deployment without separate JRE.
WASM itself doesn't support a lot of Java8 features specifically needed to run a Trino Server, namely threads (shared everything?), garbage collection, and reflection types/relaxed dead code.
So until WebASM comes further along, there's not many full-proof and open solutions to do this. As I mention in some of the Issue notes (it's buried) TeaVM has the most potential for developing a hack in the meantime. The hack is, it re-implements the JVM classes itself which feels both pragmatic but dangerous as an adopter until they have enough momentum. They have even implemented some portions of the Java reflection classes.
I don't think it will be worth doing the hack over just downloading and managing our own JRE. The main reason I want a web assembly compiler option would also enable us to run Trino servers in the browser. If WebASM moves very slowly for all the features we would need and we all really want the in-browser option, I'd say it's something worth exploring.
Until then, local + JDK management is the way IMO.
How to handle port management in the proxy service?
One option might be to enable Trino to listen on a Unix socket and avoid port management entirely? Unix sockets now work on recent versions of Windows too (since 2017).
This would be something I'd want to ask @dain or @electrum about. I don't know enough about unix sockets to have an opinion, or know if there are use cases where this could bite us.
Update: We seem to use unix timestamps in the HDFS storage settings, but adding an option for Unix sockets in place of the HTTP(S) setting to connect to Trino nodes would muddy up the code for this part. I'd rather keep the code using the same path IMO. Still, let's see if Dain/David have another opinion.
Tools like DuckDB and pySpark have had a lot of success due to the options for fast onboarding user experiences. Python and Rust communities have further simplified the ways in which you can quickly install, or even entirely skip the installation process to make local execution and embedded tutorials easier with fewer dependencies. I'd like to propose a design to justify and develop MVP features that will make Trino much simpler and portable to make the beginner, local, and onboarding Trino experience better.
Proposed features
MVP features
"Native" single-node with CLI client installation Similar to the default Docker image but without Docker required. It's also similar to Spark's standalone installation scripts. I put "native" in quotes because the experience should feel native, but will still require the next MVP feature.
Why?
Having a CLI that has a little single-node Trino cluster is lightweight and gets users to understand the value of Trino faster. This opens up opportunities for local database use cases, and catches Trino up with recent incredibly popular smaller analytics databases like DuckDB which took off due to its simplicity, portability and features aimed at researchers, academics, and analysts who prefer pragmatic or "friendly" SQL, integrations with Python,R,Arrow, and perhaps most importantly WebAssembly to integrate into browsers. While it's safe to say we don't want to create arbitrary SQL standards, there are plenty of ways in which we can incorporate those features like import local files through [table functions](https://trino.io/docs/current/functions/table.html) and integrating well with Python, ADBC, and other libraries, while keeping to the ANSI specification.
JRE install/management automation
Why?
Rather than wasting effort on searching and depending on some other JRE with god-knows-what-settings, we should just pull down a minimal instance depending on the user's needs.
Lightweight proxy management service
Why?
Spark-submit which calls spark-class is a common way to interface with Spark without needing upfront knowledge. This however, requires Spark to spin up and spin down Spark on every operation unless they already have a cluster running which is currently what is required for Trino. There are options to keep the Spark session alive longer but it's buried under Sparks abundant config and feature choices. I prefer the developer experience to be closer to DuckDB's connection where your only option is in-memory versus specifying the location of your database. Trino's architecture still requires some of what Spark has, so I would instead propose we aim to install a thin proxy service that hides the complexity but has a DuckDB look and feel.
V1 Features
WebAssembly (WASM) compatibility:
Note: This full version of WASM compatibility should happen in another issue, but I want to have this here to discuss some temporary workarounds that might work until WASM adds more features.
Why?
WASM enables us to run systems within the browser like DuckDB and would enable a vast amount of education and docs embedding for Trino and run in most recent browsers supporting WASM. This is commonly done through building Rust implementations with Python bindings, both of which have popular compilers like wasm-pack. WASM is missing certain features that Trino needs such as Threads, Reflection, Garbage Collection, among other features that may stifle full use of Trino on WASM. One option to consider would be a simpler code path for Trino that if passed a certain flag, would avoid the use of certain features like threading, single node option of Trino doesn't actually require much. This would be something we could maintain while waiting for more capabilities of WASM to unfold.
WASM details for future issue
The WASM team offers a compiler framework tool called [binaryen](https://github.com/WebAssembly/binaryen) which aims to provide some standardization for compiler builders and is valuable to be aware of. Here are a list of current Java to WASM tools: - [J2CL](https://github.com/google/j2cl) (Apache v2) is an active Google Java to Javascript/WASM tool used and primarily contributed to by the Chromium team. They use WASM's binaryen tool, it supports a lot of core Java 8 features, and of course doesn't support some based on limitations WASM, [mainly reflection](https://github.com/google/j2cl/blob/master/docs/limitations.md), but threads may work. - [TeaVM](https://teavm.org) has an interesting and rebellious approach to compiling Java to WebASM. It re-implements the [JVM classes itself](https://teavm.org/docs/runtime/java-classes.html) which feels both pragmatic but dangerous as an adopter until they have enough momentum. They have even implemented some portions of the [Java reflection classes](https://teavm.org/jcl-report/recent/packages/java.lang.reflect.html). I wonder if there would be a way to combine both methods of utilizing a fully-implemented JVM and T - [JWebAssembly](https://github.com/i-net-software/JWebAssembly) transpiles the core Java8 features and is under the permissive license, but it isn't using WASM binaryen tool and also doesn't implement Reflection. - There is [cheerpj](https://cheerpj.com/docs/overview) which seems to have a fully rewritten JVM similar to TeaVM with a lot of our needs, but it is closed-sourced but is probably doesn't cost money to use [(see ambiguous licensing)](https://github.com/konsoletyper/teavm/tree/master/classlib/src/main/java/org/teavm/classlib/java). Maybe the Trino Software Foundation could use it. - The [Wasmer Java package](https://github.com/wasmerio/wasmer-java) is interesting but in a rough state as it depends on the [wasmer-jni](https://github.com/Salpadding/wasmer-jni), both of which haven't had development since late 2022. Plus Wasmer is an entire framework that doesn't seem to have many Java users. Note: You may find [GraalVM WASM](https://www.graalvm.org/latest/reference-manual/wasm/), but this is focused on embedding a WASM engine in Java not compiling Java to WASM.
WSL2 compatibility for small Trino installs on Windows installations.
Local Developer Experience Features
UX in different environments
Core local install (Behind the scenes used by each environment)
The lifecycle will likely rely on a to check whether or not to keep the daemon running based on how long since the last daemon ran and the configured
daemon-ttl
is set for.Configuration
trino-version
, default:latest
daemon-ttl
, default:10m
./trino-daemon
<daemon-run-dir>/etc/
Catalogs
Do we add some out of the box automatically? Like
memory
,tpch
,tpcds
,local
- where local is an Iceberg catalog.Python
Local Install
pip install trino[server]
pip install trino[server]==426
<pip-root>/site-packages/trino/
CLI
Local Install
./trino --server local
./trino --server local=426
Open questions