trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.49k stars 3.02k forks source link

User-Defined JSON Schemas for MongoDB Collections #24106

Closed ErikTheBerik closed 1 week ago

ErikTheBerik commented 1 week ago

I am working with Trino to query data from a MongoDB database, and I’ve encountered some challenges around schema handling.

Issue

The documents in my MongoDB collections are complex, with many possible document types within a collection. Currently, Trino appears to generate schemas by inspecting the first document it encounters. This approach results in incomplete or incorrect schemas because Trino.

I already have schema definitions for these collections written in TypeScript, using Typebox. I can convert these Typebox definitions into JSON Schemas, but currently, from what I've found, there doesn’t seem to be a way to feed these JSON Schemas into Trinos MongoDB connector to ensure consistent schema interpretation.

Potential Feature Request

Would it be possible to allow user-defined JSON Schemas to be applied to MongoDB collections within Trino? Or is there an existing approach to the problem I'm facing? Ideally, we could supply a JSON Schema configuration that Trino would use to parse each MongoDB collection, bypassing the automatic schema inference or enhancing it. This feature would greatly simplify working with diverse document structures and eliminate the need for extensive manual schema correction.

ebyhr commented 1 week ago

You can directly modify _schema collection likes https://trino.io/docs/current/connector/mongodb.html#table-definition when the schema inference doesn't work well.

I would recommend writing a small script for translating the Typebox definitions into MongoDB connector definitions.

Habeeb556 commented 1 week ago

Hello @ebyhr,

Thank you for your help with this! I have read the documentation you referenced on Trino's MongoDB type mapping and am trying to adjust the mapping for ObjectId fields to be read directly as VARCHAR. As suggested, I'd like the _id field (and other ObjectIds) to render as readable values without needing a cast to VARCHAR, making it more compatible with Superset, which I’m using for data visualization.

I’m integrating Superset to visualize data from Trino, and here’s how it reads the data types:

I’ve managed to get it working by manually updating the schema:

db.getCollection("_schema").updateMany(
  { "fields.type": { $in: ["ObjectId", "array(ObjectId)"] } }, 
  { $set: { "fields.$[elem].type": "varchar" } }, 
  { arrayFilters: [{ "elem.type": { $in: ["ObjectId", "array(ObjectId)"] } }] }
);

This works fine, but is there an automated way to apply this conversion? I tried adding the parameter mongodb.object-id-as-varchar=true in mongodb.properties, but it seems invalid.

ebyhr commented 1 week ago

is there an automated way to apply this conversion?

Nope.