slingdata-io / sling-cli

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.
https://docs.slingdata.io
GNU General Public License v3.0
398 stars 27 forks source link

Sling stalls when large file target for StarRocks #229

Closed alberttwong closed 5 months ago

alberttwong commented 6 months ago

related to https://github.com/slingdata-io/sling-cli/issues/204. It just stops loading more rows.

atwong@Albert-CelerData ~ % sling run --src-stream 'file:///Users/atwong/Downloads/user_behavior_sample_data.parquet' --tgt-conn STARROCKSLOCAL --tgt-object 'testing.user_behavior' --mode full-refresh
10:03AM INF connecting to target database (starrocks)
10:03AM INF reading from source file system (file)
10:03AM INF writing to target database [mode: full-refresh]
10:03AM INF streaming data
10:03AM INF importing into StarRocks via stream load
10:03AM WRN StarRocks redirected the API call to 'http://localhost:8040'. Please use that as your FE url.
6m47s 20,250,001 0 r/s
alberttwong commented 6 months ago

it seems to be doing 250k rows at a time.

alberttwong commented 6 months ago

it just stops.

2m47s 26,760,893 153620 r/s  | 0 B/s | 81% MEM | 27% CPU 2024-03-15 10:15:21 DBG loading /var/folders/xn/_h8gfpfs3vv_m6gkxm8t0l380000gn/T/starrocks/testing/user_behavior_tmp/2024-03-15T101233.285/part.01.0108.csv [21 MB] ds.1710522751246.ooT-0
2024-03-15 10:15:21 DBG stream-load completed for /var/folders/xn/_h8gfpfs3vv_m6gkxm8t0l380000gn/T/starrocks/testing/user_behavior_tmp/2024-03-15T101233.285/part.01.0108.csv => {
    "TxnId": 629,
    "Label": "5b7e4e91-8cd6-418c-bd6c-3524d8136fec",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 250000,
    "NumberLoadedRows": 250000,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 21160895,
    "LoadTimeMs": 251,
    "BeginTxnTimeMs": 0,
    "StreamLoadPlanTimeMs": 1,
    "ReadDataTimeMs": 96,
    "WriteDataTimeMs": 233,
    "CommitAndPublishTimeMs": 15
}
2m49s 27,083,971 154568 r/s  | 0 B/s | 81% MEM | 32% CPU 2024-03-15 10:15:23 DBG loading /var/folders/xn/_h8gfpfs3vv_m6gkxm8t0l380000gn/T/starrocks/testing/user_behavior_tmp/2024-03-15T101233.285/part.01.0109.csv [21 MB] ds.1710522751246.ooT-0
2024-03-15 10:15:23 DBG stream-load completed for /var/folders/xn/_h8gfpfs3vv_m6gkxm8t0l380000gn/T/starrocks/testing/user_behavior_tmp/2024-03-15T101233.285/part.01.0109.csv => {
    "TxnId": 630,
    "Label": "d9f284b7-b4dc-4331-be59-9bfb249655eb",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 250000,
    "NumberLoadedRows": 250000,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 21160259,
    "LoadTimeMs": 251,
    "BeginTxnTimeMs": 0,
    "StreamLoadPlanTimeMs": 0,
    "ReadDataTimeMs": 109,
    "WriteDataTimeMs": 235,
    "CommitAndPublishTimeMs": 15
}
2m50s 27,235,258 154357 r/s  | 0 B/s | 81% MEM | 33% CPU 2024-03-15 10:15:24 DBG loading /var/folders/xn/_h8gfpfs3vv_m6gkxm8t0l380000gn/T/starrocks/testing/user_behavior_tmp/2024-03-15T101233.285/part.01.0110.csv [21 MB] ds.1710522751246.ooT-0
3m44s 32,000,001 29977 r/s  | 0 B/s | 81% MEM | 13% CPU
alberttwong commented 6 months ago

everything looks fine from the starrocks side. what is weird is that the txn id is 643 but it stopped on 630 on the sling side.

2024-03-15 17:17:39,990 INFO (AutoStatistic|27) [StatisticAutoCollector.runAfterCatalogReady():70] auto collect full statistic on all databases start
2024-03-15 17:17:39,991 INFO (AutoStatistic|27) [StatisticAutoCollector.runAfterCatalogReady():88] auto collect full statistic on all databases end
2024-03-15 17:17:40,156 INFO (tablet scheduler|35) [ClusterLoadStatistic.classifyBackendByLoad():163] classify backend by load. medium: HDD, avg load score: 1.0
, low/mid/high: 0/1/0
2024-03-15 17:17:40,156 INFO (tablet scheduler|35) [TabletScheduler.updateClusterLoadStatistic():481] update cluster load statistic:
    be id: 10004, is available: true, mediums: [{medium: HDD, replica: 144, used: 2102149191, total: 141.5GB, score: 1.0},{medium: SSD, replica: 0, used: 0, tot
al: 0B, score: NaN},], paths: [{path: /data/deploy/starrocks/be/storage, path hash: -3235560536975967549, be: 10004, medium: HDD, used: 2102149191, total: 15202
2007879},]

2024-03-15 17:17:41,697 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (71ms) Module Backup - BackupHandler estimated 44.1MB of memory. Con
tains BackupOrRestoreJob with 0 object(s). 
2024-03-15 17:17:41,697 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (0ms) Module Coordinator - QeProcessorImpl estimated 224B of memory.
 Contains QueryInfo with 0 object(s). 
2024-03-15 17:17:41,697 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (0ms) Module Delete - DeleteMgr estimated 11.7KB of memory. Contains
 DeleteInfo with 29 object(s). 
2024-03-15 17:17:41,697 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (0ms) Module Dict - CacheDictManager estimated 1.7KB of memory. Cont
ains ColumnDict with 0 object(s). 
2024-03-15 17:17:41,697 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (0ms) Module Export - ExportMgr estimated 248B of memory. Contains E
xportJob with 0 object(s). 
2024-03-15 17:17:41,739 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (42ms) Module Load - LoadMgr estimated 44.2MB of memory. Contains Lo
adJob with 111 object(s). 
2024-03-15 17:17:41,739 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (0ms) Module Load - RoutineLoadMgr estimated 680B of memory. Contain
s RoutineLoad with 0 object(s). 
2024-03-15 17:17:41,739 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (0ms) Module Load - StreamLoadMgr estimated 440B of memory. Contains
 StreamLoad with 0 object(s). 
2024-03-15 17:17:41,739 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (0ms) Module LocalCatalog - InternalCatalogMemoryTracker estimated 8
1.3KB of memory. Contains Partition with 29 object(s). 
2024-03-15 17:17:41,739 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (0ms) Module Profile - ProfileManager estimated 112B of memory. Cont
ains Profile with 0 object(s). LoadProfile with 0 object(s). 
2024-03-15 17:17:41,758 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (19ms) Module Report - ReportHandler estimated 15.1MB of memory. Con
tains PendingTask with 0 object(s). ReportQueue with 0 object(s). 
2024-03-15 17:17:41,759 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (1ms) Module Tablet - TabletInvertedIndex estimated 93.1KB of memory
. Contains TabletMeta with 144 object(s). 
2024-03-15 17:17:41,759 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (0ms) Module Task - TaskManager estimated 80B of memory. Contains Ta
sk with 0 object(s). 
2024-03-15 17:17:41,802 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (43ms) Module Task - TaskRunManager estimated 44.3MB of memory. Cont
ains PendingTaskRun with 0 object(s). RunningTaskRun with 0 object(s). HistoryTaskRun with 0 object(s). 
2024-03-15 17:17:41,835 INFO (MemoryUsageTracker|50) [MemoryUsageTracker.trackMemory():119] (33ms) Module Transaction - GlobalTransactionMgr estimated 44.3MB of
 memory. Contains Transaction with 340 object(s). 
2024-03-15 17:17:43,546 INFO (statistics meta manager|26) [StatisticExecutor.executeDML():338] statistic execute DML | QueryId [ef55262b-e2ef-11ee-b916-0242ac11
0002] | SQL: DELETE FROM table_statistic_v1 WHERE TABLE_ID = 10672
2024-03-15 17:17:43,547 INFO (statistics meta manager|26) [DatabaseTransactionMgr.beginTransaction():315] begin transaction: txn_id: 632 with label delete_78dd7
a52-5911-49ac-bb7d-e13e02a51ddb from coordinator FE: 127.0.0.1, listner id: 10682
2024-03-15 17:17:43,548 INFO (statistics meta manager|26) [TxnStateCallbackFactory.addCallback():37] add callback of txn state : 10682. current callback size: 1
2024-03-15 17:17:43,548 INFO (statistics meta manager|26) [OlapDeleteJob.run():189] countDownLatch count: 10
2024-03-15 17:17:43,548 INFO (statistics meta manager|26) [OlapDeleteJob.run():192] waiting delete Job finish, signature: 632, timeout: 30000
2024-03-15 17:17:43,550 INFO (thrift-server-pool-0|130) [PushTask.countDownLatch():218] pushTask current latch count: 9. backend: 10004, tablet:10025
2024-03-15 17:17:43,551 INFO (thrift-server-pool-5|140) [PushTask.countDownLatch():218] pushTask current latch count: 8. backend: 10004, tablet:10027
2024-03-15 17:17:43,551 INFO (thrift-server-pool-4|134) [PushTask.countDownLatch():218] pushTask current latch count: 7. backend: 10004, tablet:10023
2024-03-15 17:17:43,551 INFO (thrift-server-pool-1|131) [PushTask.countDownLatch():218] pushTask current latch count: 6. backend: 10004, tablet:10021
2024-03-15 17:17:43,551 INFO (thrift-server-pool-2|132) [PushTask.countDownLatch():218] pushTask current latch count: 5. backend: 10004, tablet:10019
2024-03-15 17:17:43,551 INFO (thrift-server-pool-0|130) [PushTask.countDownLatch():218] pushTask current latch count: 4. backend: 10004, tablet:10017
2024-03-15 17:17:43,552 INFO (thrift-server-pool-5|140) [PushTask.countDownLatch():218] pushTask current latch count: 3. backend: 10004, tablet:10015
2024-03-15 17:17:43,552 INFO (thrift-server-pool-4|134) [PushTask.countDownLatch():218] pushTask current latch count: 2. backend: 10004, tablet:10013
2024-03-15 17:17:43,552 INFO (thrift-server-pool-1|131) [PushTask.countDownLatch():218] pushTask current latch count: 1. backend: 10004, tablet:10011
2024-03-15 17:17:43,552 INFO (thrift-server-pool-2|132) [PushTask.countDownLatch():218] pushTask current latch count: 0. backend: 10004, tablet:10009
2024-03-15 17:17:43,552 INFO (statistics meta manager|26) [OlapDeleteJob.run():215] delete job finish, countDownLatch count: 0
2024-03-15 17:17:43,552 WARN (statistics meta manager|26) [OlapDeleteJob.run():227] 
2024-03-15 17:17:43,552 INFO (statistics meta manager|26) [OlapDeleteJob.checkAndUpdateQuorum():303] check delete job quorum, txn_id: 632, total tablets: 10, qu
orum tablets: 10,
2024-03-15 17:17:43,554 INFO (statistics meta manager|26) [DatabaseTransactionMgr.commitTransaction():461] transaction:[TransactionState. txn_id: 632, label: de
lete_78dd7a52-5911-49ac-bb7d-e13e02a51ddb, db id: 10005, table id list: 10007, callback id: 10682, coordinator: FE: 127.0.0.1, transaction status: COMMITTED, er
ror replicas num: 0, replica ids: , prepare time: 1710523063547, commit time: 1710523063552, finish time: -1, write cost: 5ms, reason:  attachment: com.starrock
s.transaction.InsertTxnCommitAttachment@2aa3b57b tabletCommitInfos size: 10] successfully committed
2024-03-15 17:17:43,556 INFO (PUBLISH_VERSION|24) [PublishVersionDaemon.publishVersionForOlapTable():237] send publish tasks for txn_id: 632
2024-03-15 17:17:43,568 INFO (PUBLISH_VERSION|24) [DeleteMgr.recordFinishedJob():419] record finished deleteJob, transactionId 632, dbId 10005
2024-03-15 17:17:43,568 INFO (PUBLISH_VERSION|24) [TxnStateCallbackFactory.removeCallback():44] remove callback of txn state : 10682. current callback size: 0
2024-03-15 17:17:43,568 INFO (statistics meta manager|26) [StatisticExecutor.executeDML():338] statistic execute DML | QueryId [ef58818c-e2ef-11ee-b916-0242ac11
0002] | SQL: DELETE FROM column_statistics WHERE TABLE_ID = 10672
2024-03-15 17:17:43,569 INFO (PUBLISH_VERSION|24) [DatabaseTransactionMgr.finishTransaction():1004] finish transaction TransactionState. txn_id: 632, label: del
ete_78dd7a52-5911-49ac-bb7d-e13e02a51ddb, db id: 10005, table id list: 10007, callback id: 10682, coordinator: FE: 127.0.0.1, transaction status: VISIBLE, error
 replicas num: 0, replica ids: , prepare time: 1710523063547, commit time: 1710523063552, finish time: 1710523063567, write cost: 5ms, wait for publish cost: 4m
s, publish rpc cost: 3ms, finish txn cost: 8ms, publish total cost: 15ms, total cost: 20ms, reason:  attachment: com.starrocks.transaction.InsertTxnCommitAttach
ment@2aa3b57b tabletCommitInfos size: 10 successfully
2024-03-15 17:17:43,571 INFO (statistics meta manager|26) [DatabaseTransactionMgr.beginTransaction():315] begin transaction: txn_id: 643 with label delete_ef588
18c-e2ef-11ee-b916-0242ac110002 from coordinator FE: 127.0.0.1, listner id: -1
2024-03-15 17:17:43,572 INFO (statistics meta manager|26) [QeProcessorImpl.registerQuery():103] register query id = ef58818c-e2ef-11ee-b916-0242ac110002
2024-03-15 17:17:43,572 INFO (statistics meta manager|26) [Coordinator.prepareResultSink():708] dispatch load job: ef58818c-e2ef-11ee-b916-0242ac110002 to [TNet
workAddress(hostname:127.0.0.1, port:9060)]
2024-03-15 17:17:43,605 INFO (statistics meta manager|26) [DatabaseTransactionMgr.commitTransaction():461] transaction:[TransactionState. txn_id: 643, label: de
lete_ef58818c-e2ef-11ee-b916-0242ac110002, db id: 10005, table id list: 10030, callback id: -1, coordinator: FE: 127.0.0.1, transaction status: COMMITTED, error
 replicas num: 0, replica ids: , prepare time: 1710523063571, commit time: 1710523063604, finish time: -1, write cost: 33ms, reason:  attachment: com.starrocks.
transaction.InsertTxnCommitAttachment@4c5e5f16 tabletCommitInfos size: 10] successfully committed
2024-03-15 17:17:43,613 INFO (PUBLISH_VERSION|24) [PublishVersionDaemon.publishVersionForOlapTable():237] send publish tasks for txn_id: 643
2024-03-15 17:17:43,668 INFO (statistics meta manager|26) [QeProcessorImpl.unregisterQuery():142] deregister query id = ef58818c-e2ef-11ee-b916-0242ac110002
2024-03-15 17:17:43,668 INFO (PUBLISH_VERSION|24) [DatabaseTransactionMgr.finishTransaction():1004] finish transaction TransactionState. txn_id: 643, label: del
ete_ef58818c-e2ef-11ee-b916-0242ac110002, db id: 10005, table id list: 10030, callback id: -1, coordinator: FE: 127.0.0.1, transaction status: VISIBLE, error re
plicas num: 0, replica ids: , prepare time: 1710523063571, commit time: 1710523063604, finish time: 1710523063667, write cost: 33ms, wait for publish cost: 9ms,
 publish rpc cost: 48ms, finish txn cost: 6ms, publish total cost: 63ms, total cost: 96ms, reason:  attachment: com.starrocks.transaction.InsertTxnCommitAttachm
ent@4c5e5f16 tabletCommitInfos size: 10 successfully
2024-03-15 17:17:45,494 INFO (ReportHandler|120) [ReportHandler.tabletReport():404] backend[10004] reports 144 tablet(s). report version: 17105215500450
2024-03-15 17:17:45,495 INFO (ReportHandler|120) [TabletInvertedIndex.tabletReport():303] finished to do tablet diff with backend[10004]. sync: 0. metaDel: 0. f
oundValid: 144. foundInvalid: 0. migration: 0. found invalid transactions 0. found republish transactions 0  cost: 0 ms
alberttwong commented 6 months ago

just ran it again... so eventually sling's updates stop and then updates in in SR stop. So there is a lag on sling's console.

flarco commented 6 months ago

weird, will look closer, thanks

flarco commented 6 months ago

Hey, did some testing on, and I believe this is happening on the SR side. Possibly a bug in the STREAM LOAD interface?

See attached sr_test.zip file with 2 files: README.md and strea_load.sh.

Basically, we're having sling convert the parquet file into many CSV files, and then attempting to stream load via curl commands. It's getting stuck as well...

alberttwong commented 5 months ago

It looks like we have to pass another parameter to starrocks stream load. https://github.com/StarRocks/starrocks/issues/43359#issuecomment-2033694374

flarco commented 5 months ago

Thanks, will add. Although, won't the timeout in the Header fail the job (if reached)? I'll try some test.

flarco commented 5 months ago

v1.2.3 released with changes

flarco commented 5 months ago

Closing this, feel free to re-open.