Closed alberttwong closed 3 months ago
control-d out of duckdb (exit duckdb)
atwong@Albert-CelerData ~ % sling run --src-conn DUCKDB --src-stream 'duckdb.main.call_center' --tgt-conn STARROCKSLOCAL --tgt-object 'testing.call_center' --mode full-refresh
9:25AM INF connecting to source database (duckdb)
9:25AM INF connecting to target database (starrocks)
9:25AM INF reading from source database
9:25AM INF writing to target database [mode: full-refresh]
9:25AM INF streaming data
9:25AM INF importing into StarRocks via stream load
9:25AM WRN StarRocks redirected the API call to 'http://localhost:8040'. Please use that as your FE url.
9:25AM INF created table `testing`.`call_center`
9:25AM INF inserted 6 rows into `testing`.`call_center` in 1 secs [5 r/s]
9:25AM INF execution succeeded
So your first sling run --src-conn DUCKDB --src-stream 'duckdb.main.call_center'
, the duckdb file doesn't exist, correct? (even though it says success).
The second run sling run --src-conn DUCKDB --src-stream 'duckdb.main.call_center'
says Could not set lock on file
cause you have it open with another client
So your first sling run --src-conn DUCKDB --src-stream 'duckdb.main.call_center', the duckdb file doesn't exist, correct? (even though it says success).
Yes... opened an issue on this. https://github.com/slingdata-io/sling-cli/issues/217
The second run sling run --src-conn DUCKDB --src-stream 'duckdb.main.call_center' says Could not set lock on file cause you have it open with another client
Yes. I had to quit. I was documenting my experience.
atwong@Albert-CelerData ~ % cat replication.yaml
source: DUCKDB
target: STARROCKSLOCAL
# default config options which apply to all streams
defaults:
mode: full-refresh
object: new_schema.{stream_schema}_{stream_table}
streams:
main.*:
atwong@Albert-CelerData ~ % sling run -r ./replication.yaml
9:39AM INF Sling Replication [24 streams] | DUCKDB -> STARROCKSLOCAL
9:39AM INF [1 / 24] running stream "main"."call_center"
9:39AM INF connecting to source database (duckdb)
9:39AM INF connecting to target database (starrocks)
9:39AM INF reading from source database
9:39AM INF writing to target database [mode: full-refresh]
9:39AM INF streaming data
9:39AM INF importing into StarRocks via stream load
9:39AM WRN StarRocks redirected the API call to 'http://localhost:8040'. Please use that as your FE url.
9:39AM INF created table `new_schema`.`main_call_center`
9:39AM INF inserted 6 rows into `new_schema`.`main_call_center` in 0 secs [7 r/s]
9:39AM INF execution succeeded
9:39AM INF [2 / 24] running stream "main"."catalog_page"
9:39AM INF connecting to source database (duckdb)
9:39AM INF connecting to target database (starrocks)
9:39AM INF reading from source database
stuck on the 2nd stream.
StarRocks > show databases;
+--------------------+
| Database |
+--------------------+
| _statistics_ |
| information_schema |
| new_schema |
| sys |
| testing |
+--------------------+
5 rows in set (0.00 sec)
StarRocks > use new_schema;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
StarRocks > show tables;
+----------------------+
| Tables_in_new_schema |
+----------------------+
| main_call_center |
+----------------------+
1 row in set (0.00 sec)
StarRocks >
hmm... I have to change the replication.
new replication.yaml
atwong@Albert-CelerData ~ % cat replication.yaml
source: DUCKDB
target: STARROCKSLOCAL
# default config options which apply to all streams
defaults:
mode: full-refresh
object: main.{stream_table}
streams:
main.*:
stuck on 2nd replication. https://github.com/slingdata-io/sling-cli/issues/222
Yea I think something is off with duckdb reading, some lock.
Fixed by:
Running the following replication (with limit: 50000):
source: duckdb
target: starrocks
defaults:
mode: full-refresh
object: 'duckdb.{stream_table}'
source_options:
limit: 50000
streams:
main.*:
2024-03-13 10:24:48 INF Sling Replication [24 streams] | duckdb -> starrocks
2024-03-13 10:24:48 INF [1 / 24] running stream "main"."call_center"
2024-03-13 10:24:48 DBG Sling version: dev (darwin arm64)
2024-03-13 10:24:48 DBG type is db-db
2024-03-13 10:24:48 DBG using source options: {"empty_as_null":true,"null_if":"NULL","datetime_format":"AUTO","max_decimals":-1,"limit":50000,"columns":{}}
2024-03-13 10:24:48 DBG using target options: {"datetime_format":"auto","max_decimals":-1,"use_bulk":true,"add_new_columns":true,"column_casing":"source"}
2024-03-13 10:24:48 INF connecting to source database (duckdb)
2024-03-13 10:24:48 INF connecting to target database (starrocks)
2024-03-13 10:24:48 INF reading from source database
........
2024-03-13 10:25:29 INF Sling Replication Completed in 41 secs | duckdb -> starrocks | 24 Successes | 0 Failures
Closing.
so limit makes it so much faster than not having limit. why?
limit: 50000
limits the each stream to 50k rows.
I'll also try with larger limits. Ideally for starrocks it's 100mb of data which is about 8 million rows.
For sure. You can just remove the limit part, it was just to demo.
On Sun, Mar 17, 2024, 12:19 AM Albert T. Wong @.***> wrote:
I'll also try with larger limits. Ideally for starrocks it's 100mb of data which is about 8 million rows.
— Reply to this email directly, view it on GitHub https://github.com/slingdata-io/sling-cli/issues/216#issuecomment-2002291589, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2QZYXV3BUXIFO4AYA4EODYYUDUVAVCNFSM6AAAAABESR76MOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGI4TCNJYHE . You are receiving this because you modified the open/close state.Message ID: @.***>
Using https://atwong.medium.com/easiest-way-to-load-tpc-ds-data-into-postgresql-1ebd83871a07 to generate TPC DS data in Duckdb.