heterodb / pg-strom

PG-Strom - Master development repository
http://heterodb.github.io/pg-strom/
Other
1.27k stars 163 forks source link

[vtj-jp]pg2arrow:パラレル実行の際、worker0の処理内容も出力してほしい #750

Closed sakaik closed 2 months ago

sakaik commented 2 months ago

概要と要望

pg2arrow を -nオプション付きでパラレル実行する際に(--progress 指定時に)出力される各ワーカ情報に、worker 0 の情報が含まれていません。 正しく分割できているか不安にもなるので、worker 0 も出力してほしいです。

$ pg2arrow -u postgres -d mydb -c 'SELECT * FROM ghevents_i WHERE id % $(N_WORKERS) = $(WORKER_ID)' -n5 -o out.arrow --progress
worker:3 SQL=[SELECT * FROM ghevents_i WHERE id % 5 = 3]
worker:4 SQL=[SELECT * FROM ghevents_i WHERE id % 5 = 4]
worker:1 SQL=[SELECT * FROM ghevents_i WHERE id % 5 = 1]
worker:2 SQL=[SELECT * FROM ghevents_i WHERE id % 5 = 2]
2024-04-17 03:10:03 RecordBatch[0]: offset=1400 length=268436680 (meta=1160, body=268435520) nitems=1533817 by worker:0
:

↑ここに

worker:0 SQL=[SELECT * FROM ghevents_i WHERE id % 5 = 0]

も出てきて欲しい。

改善案

kaigai commented 2 months ago

d769913d45c8477b5c13c5a1412a0db89e0ed072 で直しました。 あと、SQLのコマンドを出力するタイミングを、PostgreSQLにクエリを投げる直前のタイミングに替えました。 SQLコマンドが起因でエラーが怒った場合、書き換え後のクエリが分かった方が良いと思いまして。

$ ./pg2arrow -d postgres -n 6 -t lineorder_mytest -o /dev/shm/hoge.arrow --progress
worker:0 SQL=[SELECT * FROM lineorder_mytest WHERE hashtid(ctid) % 6 = 0]
worker:5 SQL=[SELECT * FROM lineorder_mytest WHERE hashtid(ctid) % 6 = 5]
worker:1 SQL=[SELECT * FROM lineorder_mytest WHERE hashtid(ctid) % 6 = 1]
worker:3 SQL=[SELECT * FROM lineorder_mytest WHERE hashtid(ctid) % 6 = 3]
worker:4 SQL=[SELECT * FROM lineorder_mytest WHERE hashtid(ctid) % 6 = 4]
worker:2 SQL=[SELECT * FROM lineorder_mytest WHERE hashtid(ctid) % 6 = 2]
2024-04-17 15:00:06 RecordBatch[0]: offset=1688 length=268436376 (meta=920, body=268435456) nitems=1303083 by worker:0
2024-04-17 15:00:08 RecordBatch[1]: offset=268438064 length=268436376 (meta=920, body=268435456) nitems=1303083 by worker:5
2024-04-17 15:00:08 RecordBatch[2]: offset=536874440 length=268436376 (meta=920, body=268435456) nitems=1303083 by worker:1
2024-04-17 15:00:08 RecordBatch[3]: offset=805310816 length=268436376 (meta=920, body=268435456) nitems=1303083 by worker:4
2024-04-17 15:00:08 RecordBatch[4]: offset=1073747192 length=268436376 (meta=920, body=268435456) nitems=1303083 by worker:3
sakaik commented 2 months ago

worker 0 も出力されるようになったことを確認しました。 今回の私の例ではクエリの書き換えが発生せず、前回と同様のSQLが出力されましたが、実際に実行されたクエリを確認できる仕様は良いですね。

$ pg2arrow -u postgres -d mydb -c 'SELECT * FROM ghevents_i WHERE id % $(N_WORKERS) = $(WORKER_ID)' -n5 -o out.arrow --progress

worker:0 SQL=[SELECT * FROM ghevents_i WHERE id % 5 = 0]
worker:1 SQL=[SELECT * FROM ghevents_i WHERE id % 5 = 1]
worker:2 SQL=[SELECT * FROM ghevents_i WHERE id % 5 = 2]
worker:3 SQL=[SELECT * FROM ghevents_i WHERE id % 5 = 3]
worker:4 SQL=[SELECT * FROM ghevents_i WHERE id % 5 = 4]
2024-04-17 06:15:50 RecordBatch[0]: offset=1400 length=268436616 (meta=1160, body=268435456) nitems=1529399 by worker:0
:

d769913d45c8477b5c13c5a1412a0db89e0ed072