slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
299 stars 16 forks source link

Trino SQL queries timeout #330

Open Evicence opened 3 days ago

Evicence commented 3 days ago

Issue Description

Trying to move data from Trino to Postgres using Python Sling library with the following code:

from sling import Sling
import os

os.environ["TRINO"] = '{ type: trino, http_url: "https://user:pass@host:443?catalog=iceberg" }'

config = {
  'source': {
    'conn': 'TRINO',
    'stream': """
      select * from iceberg.data.opinions
    """,
  },
  'target': {
    'conn': 'postgresql://user:pass@host:port/name',
    'object':  'tmp.sling'
  },
  'mode': 'full-refresh'
}

Sling(**config).run()

After executing the code, I saw a notification that execution succeeded. But when I checked the amount of transferred data, I saw that the amount of data in the source does not correspond to the amount of data in the target.

I began to investigate why this is happening and found in the Trino logs that the SQL queries sent by Sling to Trino are executed for exactly 1 minute, after which they go to the USER CANCELED state. After the query goes into the USER CANCELED state, Sling transfers only that piece of data that it managed to receive during that minute.

Considering that in my Trino claster the setting query.max-run-time has a value of 100 days, and the state of SQL queries is USER CANCELED, I assume that there is some problem with the timeout on the Sling side.

I also noticed that if I run the code and during its execution I kill SQL Query through the Trino UI, then Sling will transfer only the amount of data that it managed to receive during the Query execution, after which it reports on successful execution, although the query that was getting the data was killed.

To reproduce the problem locally, you need to complete two steps:

  1. Download and Run this trino-docker-compose.zip (This docker compose contains 2 services Trino and Postgres.)
  2. Execute the following Python code:
    
    from sling import Sling
    import os

os.environ["TRINO"] = '{ type: trino, http_url: "http://admin@localhost:8080?catalog=postgres" }'

config = { 'source': { 'conn': 'TRINO', 'stream': """ select * from table(postgres.system.query(query=>' select md5(random()::text) from generate_series(1, 10000000) -- generating 10 million rows here, generate more if needed ')) """, }, 'target': { 'conn': 'postgresql://postgres:postgres@localhost:5433/postgres?sslmode=disable', 'object': 'public.sling' }, 'mode': 'full-refresh' }

Sling(**config).run()



If you are not familiar with Trino, it has UI where you can view SQL queries and information about them, this may be useful.
The UI can be accessed at - http://localhost:8080/ui (Username - `admin`)
To see all SQL queries in the UI, make sure that you have the `Finished` and `Failed -> User Error` states checked in filter.