MongoDB connector no parallelism

trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

https://trino.io

Apache License 2.0

9.88k stars 2.86k forks source link

MongoDB connector no parallelism #4077

Open abhisheksahani opened 4 years ago

abhisheksahani commented 4 years ago

we tested querying mongo collection via Trino and we observe that only a single worker is getting the data from Mongo collection and applying operators aggregation.

Can't we achieve parallelism via Trino Mongo connector?.

findepi commented 4 years ago

Correct. currently the Mongo Connector is "single-threaded", it creates single "split" to read data.

For anyone wanting to improve this, the code is here https://github.com/starburstdata/presto/blob/d37eddd6c20e9c285324cbd6532a33cc52db6425/presto-mongodb/src/main/java/io/prestosql/plugin/mongodb/MongoSplitManager.java#L50

academy-codex commented 2 years ago

@findepi any reason you pointed to starburst fork? The implementation is same as current Trino code. Both have a single split if I’m not wrong.

Just asking as I want to pick this up next and was looking for some idea around. Thanks.

findepi commented 2 years ago

any reason you pointed to starburst fork?

that was a mistake

ChandrasekharPo-Kore commented 1 year ago

We have similar problem. Since all our data was time series, as a work around , we split our fetch query into multiple sub queries and joined them with "union all". In majority of the cases we pushed the aggregates to these sub queries.

Posting it here to understand how other teams are handling this limitation. cc @findepi @abhisheksahani @academy-codex @joshk

abhisheksahani commented 1 year ago

@ChandrasekharPo-Kore Currently I am dumping mongo data to the presto in-memory table and every 20 minutes we are appending the in-memory table with delta/latest-updated data from mongo. We are running queries over the in-memory table.

ChandrasekharPo-Kore commented 1 year ago

@abhisheksahani our datasize is huge, i doubt in memory is an option. Any improvements made ?