timeplus-io / proton

A streaming SQL engine, a fast and lightweight alternative to ksqlDB and Apache Flink, 🚀 powered by ClickHouse.
https://timeplus.com
Apache License 2.0
1.56k stars 68 forks source link

crash: when a mv write to a stream from a rand stream. #735

Open yokofly opened 5 months ago

yokofly commented 5 months ago

Describe what's wrong

2024.05.17 03:20:18.194438 [ 33 ] {cfacb0d4-a21d-4fa5-b0b7-f0d10d20ec7e} <Information> StorageMaterializedView (default.mv): Took 2 ms to wait for built background pipeline during matierialized view 'default.mv' startup
2024.05.17 03:20:18.194499 [ 333 ] {.inner.query-id.from-408480d3-2064-48e7-91fe-fcfe70397a61} <Information> PipelineExecutor: Using 20 threads to execute pipeline for query_id=.inner.query-id.from-408480d3-2064-48e7-91fe-fcfe70397a61
2024.05.17 03:20:18.194816 [ 333 ] {.inner.query-id.from-408480d3-2064-48e7-91fe-fcfe70397a61} <Information> LocalFileSystemCheckpoint: Took 1 ms to checkpoint to /var/lib/proton/checkpoint/.inner.query-id.from-408480d3-2064-48e7-91fe-fcfe70397a61/dag.ckpt, compressed_size=1144, uncompressed_size=925
2024.05.17 03:20:18.194878 [ 333 ] {.inner.query-id.from-408480d3-2064-48e7-91fe-fcfe70397a61} <Information> LocalFileSystemCheckpoint: Took 0 ms to checkpoint to /var/lib/proton/checkpoint/.inner.query-id.from-408480d3-2064-48e7-91fe-fcfe70397a61/query.ckpt, compressed_size=68, uncompressed_size=41
2024.05.17 03:20:18.194903 [ 333 ] {.inner.query-id.from-408480d3-2064-48e7-91fe-fcfe70397a61} <Information> CheckpointCoordinator: Register query=.inner.query-id.from-408480d3-2064-48e7-91fe-fcfe70397a61 with 900 seconds checkpoint interval, source_node_descriptions=0-Random, ack_node_descriptions=5-EmptySink
^C2024.05.17 03:20:22.071332 [ 32 ] {} <Information> Application: Received termination signal (Interrupt)
2024.05.17 03:20:22.198153 [ 340 ] {} <Information> default.v (3ecab0be-ae17-4676-ae6d-c54eed230692): Committed sn=399 for shard=0 to local file system
2024.05.17 03:20:22.881160 [ 320 ] {} <Information> system.metric_log (7d372865-d0b9-42ae-8025-589ff5051941): Found 0 parts for disk 'default' to load
2024.05.17 03:20:22.881208 [ 312 ] {} <Information> system.query_log (967fa75b-c7d4-41b3-a980-0d1213a07348): Found 0 parts for disk 'default' to load
2024.05.17 03:20:22.881255 [ 316 ] {} <Information> system.trace_log (e6f6d424-4670-4bcf-a4e1-374861335755): Found 0 parts for disk 'default' to load
2024.05.17 03:20:22.881330 [ 313 ] {} <Information> system.part_log (2dc4f295-87fd-4cee-9db3-d313afef3309): Found 0 parts for disk 'default' to load
2024.05.17 03:20:22.881683 [ 314 ] {} <Information> system.query_thread_log (b3d80e57-2e37-41ae-823e-759f78391946): Found 0 parts for disk 'default' to load
2024.05.17 03:20:22.940554 [ 1 ] {} <Information> Application: Closed all listening sockets. Waiting for 1 outstanding connections.
2024.05.17 03:20:22.940586 [ 1 ] {} <Information> CheckpointCoordinator: Trigger last checkpoint and flush begin
2024.05.17 03:20:22.940894 [ 382 ] {} <Fatal> BaseDaemon: ########## Short fault info ############
2024.05.17 03:20:22.940927 [ 382 ] {} <Fatal> BaseDaemon: (version 1.5.8, build id: E8B09E6E8FB8A2EEEB1DFA6F88518F1D7A4B9E96, git hash: 26b4810034decd3fcb91508d672856b04ff536b1) (from thread 1) Received signal 11
2024.05.17 03:20:22.940934 [ 382 ] {} <Fatal> BaseDaemon: Signal description: Segmentation fault
2024.05.17 03:20:22.940941 [ 382 ] {} <Fatal> BaseDaemon: Address: 0x1b8 Access: read. Address not mapped to object.
2024.05.17 03:20:22.940952 [ 382 ] {} <Fatal> BaseDaemon: Stack trace: 0x00000000168f1625 0x000000001a5e100e 0x000000001a5e34d5 0x00000000101874b8 0x0000000010181c2c 0x000000001a93cb46 0x0000000010173c79 0x000000001a94fe33 0x00000000101719fa 0x000000000afcaadd 0x00007f607b3f0083 0x000000000afca02e
2024.05.17 03:20:22.940957 [ 382 ] {} <Fatal> BaseDaemon: ########################################
2024.05.17 03:20:22.940962 [ 382 ] {} <Fatal> BaseDaemon: (version 1.5.8, build id: E8B09E6E8FB8A2EEEB1DFA6F88518F1D7A4B9E96, git hash: 26b4810034decd3fcb91508d672856b04ff536b1) (from thread 1) (no query) Received signal Segmentation fault (11)
2024.05.17 03:20:22.940966 [ 382 ] {} <Fatal> BaseDaemon: Address: 0x1b8 Access: read. Address not mapped to object.
2024.05.17 03:20:22.940970 [ 382 ] {} <Fatal> BaseDaemon: Stack trace: 0x00000000168f1625 0x000000001a5e100e 0x000000001a5e34d5 0x00000000101874b8 0x0000000010181c2c 0x000000001a93cb46 0x0000000010173c79 0x000000001a94fe33 0x00000000101719fa 0x000000000afcaadd 0x00007f607b3f0083 0x000000000afca02e
2024.05.17 03:20:22.940999 [ 382 ] {} <Fatal> BaseDaemon: 3. DB::ExecutingGraph::hasProcessedNewDataSinceLastCheckpoint() const @ 0x00000000168f1625 in /usr/bin/proton
2024.05.17 03:20:22.941010 [ 382 ] {} <Fatal> BaseDaemon: 4. DB::CheckpointCoordinator::doTriggerCheckpoint(std::__1::weak_ptr<DB::PipelineExecutor> const&, std::__1::shared_ptr<DB::CheckpointContext const>) @ 0x000000001a5e100e in /usr/bin/proton
2024.05.17 03:20:22.941017 [ 382 ] {} <Fatal> BaseDaemon: 5. DB::CheckpointCoordinator::triggerLastCheckpointAndFlush() @ 0x000000001a5e34d5 in /usr/bin/proton
2024.05.17 03:20:22.941029 [ 382 ] {} <Fatal> BaseDaemon: 6. BasicScopeGuard<DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>> const&)::$_7>::~BasicScopeGuard() @ 0x00000000101874b8 in /usr/bin/proton
2024.05.17 03:20:22.941038 [ 382 ] {} <Fatal> BaseDaemon: 7. DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>> const&) @ 0x0000000010181c2c in /usr/bin/proton
2024.05.17 03:20:22.941052 [ 382 ] {} <Fatal> BaseDaemon: 8. Poco::Util::Application::run() @ 0x000000001a93cb46 in /usr/bin/proton
2024.05.17 03:20:22.941057 [ 382 ] {} <Fatal> BaseDaemon: 9. DB::Server::run() @ 0x0000000010173c79 in /usr/bin/proton
2024.05.17 03:20:22.941064 [ 382 ] {} <Fatal> BaseDaemon: 10. Poco::Util::ServerApplication::run(int, char**) @ 0x000000001a94fe33 in /usr/bin/proton
2024.05.17 03:20:22.941070 [ 382 ] {} <Fatal> BaseDaemon: 11. mainServer(int, char**) @ 0x00000000101719fa in /usr/bin/proton
2024.05.17 03:20:22.941079 [ 382 ] {} <Fatal> BaseDaemon: 12. main @ 0x000000000afcaadd in /usr/bin/proton
2024.05.17 03:20:22.941086 [ 382 ] {} <Fatal> BaseDaemon: 13. __libc_start_main @ 0x00007f607b3f0083 in ?
2024.05.17 03:20:22.941091 [ 382 ] {} <Fatal> BaseDaemon: 14. _start @ 0x000000000afca02e in /usr/bin/proton
2024.05.17 03:20:22.941097 [ 382 ] {} <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
2024.05.17 03:20:23.380744 [ 319 ] {} <Information> system.crash_log (2018f2aa-b7b8-4c9c-8a3b-dfab4207fdc0): Found 0 parts for disk 'default' to load

How to reproduce

create stream v(id int);
create random stream v_rand(id int default rand()%100);
create materialized view mv into v as select * from v_rand;

then shutdown proton

Error message and/or stacktrace

Additional context

yokofly commented 5 months ago

for a temp solution: no mv, directly insert works as expected.

yokofly commented 5 months ago

well, this crash requires an exit proton it will trigger the last ckpt. If we want to skip, we can directly drop the mv after ingesting some necessary data.