slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
301 stars 16 forks source link

DuckDB to StarRocks testing with TPC DS data #216

Closed alberttwong closed 3 months ago

alberttwong commented 3 months ago

Using https://atwong.medium.com/easiest-way-to-load-tpc-ds-data-into-postgresql-1ebd83871a07 to generate TPC DS data in Duckdb.

atwong@Albert-CelerData ~ % duckdb duckdb.db
v0.10.0 20b1486d11
Enter ".help" for usage hints.
D INSTALL tpcds;
D LOAD tpcds;
D SELECT * FROM dsdgen(sf=1);
100% ▕████████████████████████████████████████████████████████████▏
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows  │
└─────────┘
D
D show all tables;
┌──────────┬─────────┬──────────────────────┬──────────────────────┬──────────────────────────────────────────────────────────────────────────┬───────────┐
│ database │ schema  │         name         │     column_names     │                               column_types                               │ temporary │
│ varchar  │ varchar │       varchar        │      varchar[]       │                                varchar[]                                 │  boolean  │
├──────────┼─────────┼──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────┼───────────┤
│ duckdb   │ main    │ call_center          │ [cc_call_center_sk…  │ [INTEGER, VARCHAR, DATE, DATE, INTEGER, INTEGER, VARCHAR, VARCHAR, INT…  │ false     │
│ duckdb   │ main    │ catalog_page         │ [cp_catalog_page_s…  │ [INTEGER, VARCHAR, INTEGER, INTEGER, VARCHAR, INTEGER, INTEGER, VARCHA…  │ false     │
│ duckdb   │ main    │ catalog_returns      │ [cr_returned_date_…  │ [INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGE…  │ false     │
│ duckdb   │ main    │ catalog_sales        │ [cs_sold_date_sk, …  │ [INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGE…  │ false     │
│ duckdb   │ main    │ customer             │ [c_customer_sk, c_…  │ [INTEGER, VARCHAR, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, VARCHA…  │ false     │
│ duckdb   │ main    │ customer_address     │ [ca_address_sk, ca…  │ [INTEGER, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHA…  │ false     │
│ duckdb   │ main    │ customer_demograph…  │ [cd_demo_sk, cd_ge…  │ [INTEGER, VARCHAR, VARCHAR, VARCHAR, INTEGER, VARCHAR, INTEGER, INTEGE…  │ false     │
│ duckdb   │ main    │ date_dim             │ [d_date_sk, d_date…  │ [INTEGER, VARCHAR, DATE, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, …  │ false     │
│ duckdb   │ main    │ household_demograp…  │ [hd_demo_sk, hd_in…  │ [INTEGER, INTEGER, VARCHAR, INTEGER, INTEGER]                            │ false     │
│ duckdb   │ main    │ income_band          │ [ib_income_band_sk…  │ [INTEGER, INTEGER, INTEGER]                                              │ false     │
│ duckdb   │ main    │ inventory            │ [inv_date_sk, inv_…  │ [INTEGER, INTEGER, INTEGER, INTEGER]                                     │ false     │
│ duckdb   │ main    │ item                 │ [i_item_sk, i_item…  │ [INTEGER, VARCHAR, DATE, DATE, VARCHAR, DECIMAL(7,2), DECIMAL(7,2), IN…  │ false     │
│ duckdb   │ main    │ promotion            │ [p_promo_sk, p_pro…  │ [INTEGER, VARCHAR, INTEGER, INTEGER, INTEGER, DECIMAL(15,2), INTEGER, …  │ false     │
│ duckdb   │ main    │ reason               │ [r_reason_sk, r_re…  │ [INTEGER, VARCHAR, VARCHAR]                                              │ false     │
│ duckdb   │ main    │ ship_mode            │ [sm_ship_mode_sk, …  │ [INTEGER, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR]                   │ false     │
│ duckdb   │ main    │ store                │ [s_store_sk, s_sto…  │ [INTEGER, VARCHAR, DATE, DATE, INTEGER, VARCHAR, INTEGER, INTEGER, VAR…  │ false     │
│ duckdb   │ main    │ store_returns        │ [sr_returned_date_…  │ [INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGE…  │ false     │
│ duckdb   │ main    │ store_sales          │ [ss_sold_date_sk, …  │ [INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGE…  │ false     │
│ duckdb   │ main    │ time_dim             │ [t_time_sk, t_time…  │ [INTEGER, VARCHAR, INTEGER, INTEGER, INTEGER, INTEGER, VARCHAR, VARCHA…  │ false     │
│ duckdb   │ main    │ warehouse            │ [w_warehouse_sk, w…  │ [INTEGER, VARCHAR, VARCHAR, INTEGER, VARCHAR, VARCHAR, VARCHAR, VARCHA…  │ false     │
│ duckdb   │ main    │ web_page             │ [wp_web_page_sk, w…  │ [INTEGER, VARCHAR, DATE, DATE, INTEGER, INTEGER, VARCHAR, INTEGER, VAR…  │ false     │
│ duckdb   │ main    │ web_returns          │ [wr_returned_date_…  │ [INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGE…  │ false     │
│ duckdb   │ main    │ web_sales            │ [ws_sold_date_sk, …  │ [INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, INTEGE…  │ false     │
│ duckdb   │ main    │ web_site             │ [web_site_sk, web_…  │ [INTEGER, VARCHAR, DATE, DATE, VARCHAR, INTEGER, INTEGER, VARCHAR, VAR…  │ false     │
├──────────┴─────────┴──────────────────────┴──────────────────────┴──────────────────────────────────────────────────────────────────────────┴───────────┤
│ 24 rows                                                                                                                                       6 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D
sling conns set DUCKDB type=duckdb instance=/Users/atwong/file.db
docker run --rm -p 9030:9030 -p 8030:8030 -p 8040:8040 -it starrocks/allin1-ubuntu
atwong@Albert-CelerData ~ % sling conns test DUCKDB
9:19AM INF success!
atwong@Albert-CelerData ~ % sling run --src-conn DUCKDB  --src-stream 'duckdb.main.call_center' --tgt-conn STARROCKSLOCAL --tgt-object 'testing' --mode full-refresh
9:19AM WRN Could not successfully get format values. Blank values for: object_schema
9:19AM INF connecting to source database (duckdb)
9:19AM INF connecting to target database (starrocks)
9:19AM INF reading from source database
9:19AM WRN Could not successfully get format values. Blank values for: object_schema
9:19AM INF execution failed
fatal:
~ Could not get source columns
did not find any columns for "main"."call_center". Perhaps it does not exists, or user does not have read permission.
atwong@Albert-CelerData ~ % sling conns set DUCKDB type=duckdb instance=/Users/atwong/duckdb.db
9:21AM INF connection `DUCKDB` has been set in /Users/atwong/.sling/env.yaml. Please test with `sling conns test DUCKDB`
atwong@Albert-CelerData ~ % sling conns test DUCKDB
9:21AM INF success!
atwong@Albert-CelerData ~ % sling run --src-conn DUCKDB  --src-stream 'duckdb.main.call_center' --tgt-conn STARROCKSLOCAL --tgt-object 'testing' --mode full-refresh
9:21AM WRN Could not successfully get format values. Blank values for: object_schema
9:21AM INF connecting to source database (duckdb)
9:21AM INF connecting to target database (starrocks)
9:21AM INF reading from source database
9:21AM WRN Could not successfully get format values. Blank values for: object_schema
9:21AM INF execution failed
fatal:
~ Error with StreamRows
Error: unable to open database "/Users/atwong/duckdb.db": IO Error: Could not set lock on file "/Users/atwong/duckdb.db": Conflicting lock is held in /opt/homebrew/Cellar/duckdb/0.10.0/bin/duckdb (PID 98202) by user atwong. See also https://duckdb.org/docs/connect/concurrency
alberttwong commented 3 months ago

control-d out of duckdb (exit duckdb)

atwong@Albert-CelerData ~ % sling run --src-conn DUCKDB  --src-stream 'duckdb.main.call_center' --tgt-conn STARROCKSLOCAL --tgt-object 'testing.call_center' --mode full-refresh
9:25AM INF connecting to source database (duckdb)
9:25AM INF connecting to target database (starrocks)
9:25AM INF reading from source database
9:25AM INF writing to target database [mode: full-refresh]
9:25AM INF streaming data
9:25AM INF importing into StarRocks via stream load
9:25AM WRN StarRocks redirected the API call to 'http://localhost:8040'. Please use that as your FE url.
9:25AM INF created table `testing`.`call_center`
9:25AM INF inserted 6 rows into `testing`.`call_center` in 1 secs [5 r/s]
9:25AM INF execution succeeded
flarco commented 3 months ago

So your first sling run --src-conn DUCKDB --src-stream 'duckdb.main.call_center', the duckdb file doesn't exist, correct? (even though it says success).

The second run sling run --src-conn DUCKDB --src-stream 'duckdb.main.call_center' says Could not set lock on file cause you have it open with another client

alberttwong commented 3 months ago

So your first sling run --src-conn DUCKDB --src-stream 'duckdb.main.call_center', the duckdb file doesn't exist, correct? (even though it says success).

Yes... opened an issue on this. https://github.com/slingdata-io/sling-cli/issues/217

The second run sling run --src-conn DUCKDB --src-stream 'duckdb.main.call_center' says Could not set lock on file cause you have it open with another client

Yes. I had to quit. I was documenting my experience.

alberttwong commented 3 months ago
atwong@Albert-CelerData ~ % cat replication.yaml
source: DUCKDB
target: STARROCKSLOCAL

# default config options which apply to all streams
defaults:
  mode: full-refresh
  object: new_schema.{stream_schema}_{stream_table}

streams:
  main.*:
atwong@Albert-CelerData ~ % sling run -r ./replication.yaml
9:39AM INF Sling Replication [24 streams] | DUCKDB -> STARROCKSLOCAL

9:39AM INF [1 / 24] running stream "main"."call_center"
9:39AM INF connecting to source database (duckdb)
9:39AM INF connecting to target database (starrocks)
9:39AM INF reading from source database
9:39AM INF writing to target database [mode: full-refresh]
9:39AM INF streaming data
9:39AM INF importing into StarRocks via stream load
9:39AM WRN StarRocks redirected the API call to 'http://localhost:8040'. Please use that as your FE url.
9:39AM INF created table `new_schema`.`main_call_center`
9:39AM INF inserted 6 rows into `new_schema`.`main_call_center` in 0 secs [7 r/s]
9:39AM INF execution succeeded

9:39AM INF [2 / 24] running stream "main"."catalog_page"
9:39AM INF connecting to source database (duckdb)
9:39AM INF connecting to target database (starrocks)
9:39AM INF reading from source database

stuck on the 2nd stream.

alberttwong commented 3 months ago
StarRocks > show databases;
+--------------------+
| Database           |
+--------------------+
| _statistics_       |
| information_schema |
| new_schema         |
| sys                |
| testing            |
+--------------------+
5 rows in set (0.00 sec)

StarRocks > use new_schema;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
StarRocks > show tables;
+----------------------+
| Tables_in_new_schema |
+----------------------+
| main_call_center     |
+----------------------+
1 row in set (0.00 sec)

StarRocks >

hmm... I have to change the replication.

alberttwong commented 3 months ago

new replication.yaml

atwong@Albert-CelerData ~ % cat replication.yaml
source: DUCKDB
target: STARROCKSLOCAL

# default config options which apply to all streams
defaults:
  mode: full-refresh
  object: main.{stream_table}

streams:
  main.*:
alberttwong commented 3 months ago

stuck on 2nd replication. https://github.com/slingdata-io/sling-cli/issues/222

flarco commented 3 months ago

Yea I think something is off with duckdb reading, some lock.

flarco commented 3 months ago

Fixed by:

Running the following replication (with limit: 50000):

source: duckdb
target: starrocks

defaults:
  mode: full-refresh
  object: 'duckdb.{stream_table}'
  source_options:
    limit: 50000

streams:
  main.*:
2024-03-13 10:24:48 INF Sling Replication [24 streams] | duckdb -> starrocks

2024-03-13 10:24:48 INF [1 / 24] running stream "main"."call_center"
2024-03-13 10:24:48 DBG Sling version: dev (darwin arm64)
2024-03-13 10:24:48 DBG type is db-db
2024-03-13 10:24:48 DBG using source options: {"empty_as_null":true,"null_if":"NULL","datetime_format":"AUTO","max_decimals":-1,"limit":50000,"columns":{}}
2024-03-13 10:24:48 DBG using target options: {"datetime_format":"auto","max_decimals":-1,"use_bulk":true,"add_new_columns":true,"column_casing":"source"}
2024-03-13 10:24:48 INF connecting to source database (duckdb)
2024-03-13 10:24:48 INF connecting to target database (starrocks)
2024-03-13 10:24:48 INF reading from source database
........
2024-03-13 10:25:29 INF Sling Replication Completed in 41 secs | duckdb -> starrocks | 24 Successes | 0 Failures

Closing.

alberttwong commented 3 months ago

so limit makes it so much faster than not having limit. why?

flarco commented 3 months ago

limit: 50000 limits the each stream to 50k rows.

alberttwong commented 3 months ago

I'll also try with larger limits. Ideally for starrocks it's 100mb of data which is about 8 million rows.

flarco commented 3 months ago

For sure. You can just remove the limit part, it was just to demo.

On Sun, Mar 17, 2024, 12:19 AM Albert T. Wong @.***> wrote:

I'll also try with larger limits. Ideally for starrocks it's 100mb of data which is about 8 million rows.

— Reply to this email directly, view it on GitHub https://github.com/slingdata-io/sling-cli/issues/216#issuecomment-2002291589, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2QZYXV3BUXIFO4AYA4EODYYUDUVAVCNFSM6AAAAABESR76MOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGI4TCNJYHE . You are receiving this because you modified the open/close state.Message ID: @.***>