Open MarcusGDaniels opened 3 years ago
Hi,
I never faced an issue like that; I expect some inconsistencies when using float especially on averages, but a SELECT DISTINCT is really unlikely to give you inconsistent values thru different hardware. Can you share the logs of the server with --verbose parameter switched to true while running the queries?
Have you tried to run the query in CPU mode? You have just to add a cpu_mode hint into the query using the usual Oracle's syntax
SELECT /+ cpu_mde / DISTINCT etc.
Candido.
Is there some way to confirm it is going to the CPU (short of figuring out how to instrument the source code)? I usually see 1 or 2 percent on the GPUs from nvidia-smi, so I'm not sure if that is normal idling or it ignoring the setting. Btw, I assume you mean "/+ cpu_mode /"? (Missing "o" in your example?)
yeah, I'm back to segfaults with the /+ cpu_mode /. I had gone to some trouble to get the CUDA build to run on this PPC system mainly because the non-CUDA build was segfaulting. The master (as opposed to 5.5.2) segfaults with or without CUDA.
There are several ways to determine on which device a query is being (or will be) executed.
If you type, explain SELECT DISTINCT, etc. The most simple and immediate is using EXPLAIN.
GPU
omnisql> explain select distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m
..> ;
Explanation
IR for the GPU:
[cut]
CPU
omnisql> explain select /*+ cpu_mode */ distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m ;
Explanation
IR for the CPU:
===============
[cut]
Probably the K80 is missing some capabilities, so the query is falling back to CPU already?
the have a better idea where and how a query is running on; I suggest you turn on the --enable-debug-timer switch looking into INFO logs, you would get something like that GPU (just 1 GPU)
2021-03-23T18:30:56.385935 I 6079 4 RelAlgExecutor.cpp:239 DEBUG_TIMER thread_id(4)
70ms total duration for executeRelAlgQuery
70ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:273
0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:274
70ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:595
70ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:674
70ms start(0ms) executeCompound RelAlgExecutor.cpp:1689
70ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:2837
3ms start(0ms) compileWorkUnit NativeCodegen.cpp:2481
New thread(14)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:93
0ms start(0ms) fetchChunks Execute.cpp:2404
0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:717
65ms start(0ms) executePlanWithGroupBy Execute.cpp:3008
65ms start(0ms) launchGpuCode QueryExecutionContext.cpp:218
0ms start(66ms) getRowSet QueryExecutionContext.cpp:154
0ms start(66ms) reduceMultiDeviceResults Execute.cpp:901
0ms start(66ms) reduceMultiDeviceResultSets Execute.cpp:947
End thread(14)
0ms start(70ms) collectAllDeviceResults Execute.cpp:1841
0ms start(70ms) reduceMultiDeviceResults Execute.cpp:901
0ms start(70ms) reduceMultiDeviceResultSets Execute.cpp:947
2021-03-23T18:30:56.386124 I 6079 1 DBHandler.cpp:1215 DEBUG_TIMER thread_id(1)
84ms total duration for sql_execute
2021-03-23T18:30:56.386192 I 6079 1 DBHandler.cpp:1212 stdlog sql_execute 7 84 omnisci admin 914-GWtY {"query_str","client","nonce","execution_time_ms","total_time_ms"} {"select distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m ;","tcp:localhost:56130","","83","84"}
CPU (4 Cores)
611ms total duration for executeRelAlgQuery
611ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:273
0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:274
611ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:595
611ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:674
611ms start(0ms) executeCompound RelAlgExecutor.cpp:1689
611ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:2837
2ms start(0ms) compileWorkUnit NativeCodegen.cpp:2481
New thread(9)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:93
0ms start(0ms) fetchChunks Execute.cpp:2404
0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:717
586ms start(0ms) executePlanWithGroupBy Execute.cpp:3008
586ms start(0ms) launchCpuCode QueryExecutionContext.cpp:596
0ms start(586ms) getRowSet QueryExecutionContext.cpp:154
End thread(9)
New thread(10)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:93
0ms start(0ms) fetchChunks Execute.cpp:2404
0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:717
608ms start(0ms) executePlanWithGroupBy Execute.cpp:3008
608ms start(0ms) launchCpuCode QueryExecutionContext.cpp:596
0ms start(609ms) getRowSet QueryExecutionContext.cpp:154
End thread(10)
New thread(11)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:93
0ms start(0ms) fetchChunks Execute.cpp:2404
0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:717
587ms start(0ms) executePlanWithGroupBy Execute.cpp:3008
587ms start(0ms) launchCpuCode QueryExecutionContext.cpp:596
0ms start(587ms) getRowSet QueryExecutionContext.cpp:154
End thread(11)
New thread(12)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:93
0ms start(0ms) fetchChunks Execute.cpp:2404
0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:717
321ms start(0ms) executePlanWithGroupBy Execute.cpp:3008
321ms start(0ms) launchCpuCode QueryExecutionContext.cpp:596
0ms start(321ms) getRowSet QueryExecutionContext.cpp:154
End thread(12)
0ms start(611ms) collectAllDeviceResults Execute.cpp:1841
0ms start(611ms) reduceMultiDeviceResults Execute.cpp:901
0ms start(611ms) reduceMultiDeviceResultSets Execute.cpp:947
2021-03-23T18:29:21.048137 I 6079 1 DBHandler.cpp:1215 DEBUG_TIMER thread_id(1)
627ms total duration for sql_execute
2021-03-23T18:29:21.048207 I 6079 1 DBHandler.cpp:1212 stdlog sql_execute 5 627 omnisci admin 914-GWtY {"query_str","client","nonce","execution_time_ms","total_time_ms"} {"select /*+ cpu_mode */ distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m ;","tcp:localhost:56130","","626","627"}
yeah, I'm back to segfaults with the /+ cpu_mode /. I had gone to some trouble to get the CUDA to build to run on this PPC system mainly because the non-CUDA build was segfaulting. The master (as opposed to 5.5.2) segfaults with or without CUDA.
Well we used to have builds that used PPC, but we stopped because no-one is asking; that's bad because there are systems with nv-link attached GPUs
Here's how things end in the INFO log for the cpu_mode on PPC. (Works fine on x86_64) I guess I'm back to a debug build. I guess there is no way to get LOC context for the kernel in the debugger, since it is JITed?
2021-03-23T13:46:32.684623 I 100855 1034 measure.h:74 Timer start lauchCpuCode launchCpuCode: 578 2021-03-23T13:46:33.578112 1 100855 1 MapDServer.cpp:271 heartbeat thread exiting 2021-03-23T13:46:33.578246 I 100855 1 MapDServer.cpp:276 Interrupt signal (11) received.
The debug build is slower than a release one.
It looks a failure of llvm compiled code...hummm... Intel guys introduced some performance enhancement in the llvm compile phase, but if I remember correctly they were just switches, so I am not sure of this could pose problems on ppc arch
yeah, I'm back to segfaults with the /+ cpu_mode /. I had gone to some trouble to get the CUDA to build to run on this PPC system mainly because the non-CUDA build was segfaulting. The master (as opposed to 5.5.2) segfaults with or without CUDA.
Well we used to have builds that used PPC, but we stopped because no-one is asking; that's bad because there are systems with nv-link attached GPUs
I've got access to such a system but I'm not using it because of the network it is on, and the sensitivity of my data. Would Omnisci have better performance there (e.g. with V100s with nvlink) than on an Intel-based system? (Just trying to justify the advocacy and work required to move, and whether this is worth debugging.)
The debug build is slower than a release one.
It looks a failure of llvm compiled code...hummm... Intel guys introduced some performance enhancement in the llvm compile phase, but if I remember correctly they were just switches, so I am not sure of this could pose problems on ppc arch
Here's the segfault, although the debugging build didn't really help pin down the line:
0x00002100029f012c in linear_probabilistic_count () (gdb) up
(gdb)
(gdb)
this=0x2102a43e3fe0, ra_exe_unit=..., native_code=0x21002321a350, hoist_literals=true, literal_buff=..., col_buffers=..., num_rows=..., frag_offsets=..., scan_limit=0,
error_code=0x2102ae48a7dc, num_tables=2, join_hash_tables=...) at /home/mgd/src/omniscidb-5.5.2/QueryEngine/QueryExecutionContext.cpp:665
665 reinterpret_cast
Interesting -- you can actually put a breakpoint in that function in a debug build if you build w/ ENABLE_JIT_DEBUG=on
. Were you able to get the QueryMemoryDescriptor output from the verbose logs?
This was the last thing in the log:
2021-03-23T16:26:21.020813 1 2093 6 Execute.cpp:2048 Query Memory Descriptor State Query Type: Estimator Keyless Hash: False Effective key width: 8 Number of group columns: 0 Col Slot Context State Empty Allow Multifrag: True Interleaved Bins on GPU: False Blocks Share Memory: False Threads Share Memory: True Uses Fast Group Values: False Lazy Init Groups (GPU): True Entry Count: 1 Min Val (perfect hash only): 0 Max Val (perfect hash only): 0 Bucket Val (perfect hash only): 0 Sort on GPU: False Use Streaming Top N: False Output Columnar: False Render Output: False Use Baseline Sort: False 2021-03-23T16:26:21.021202 1 2093 6 Execute.cpp:2110 Launching 108 kernels for query.
ok, looks like ENABLE_JIT_DEBUG is a build-time thing. will try.
The call sequence changed a bit. There's just a couple arrays in that function, the key_bytes that is used in the MurmurHash1 and the bitmap update.
(gdb) bt
this=0x21013c9ff080, ra_exe_unit=..., native_code=0x2100ac1f5b20, hoist_literals=true, literal_buff=..., col_buffers=..., num_rows=..., frag_offsets=..., scan_limit=0,
error_code=0x2100fa07a7dc, num_tables=2, join_hash_tables=...) at /home/mgd/src/omniscidb-5.5.2/QueryEngine/QueryExecutionContext.cpp:665
extern "C" NEVER_INLINE void linear_probabilistic_count(uint8_t bitmap, const uint32_t bitmap_bytes, const uint8_t key_bytes, const uint32_t key_len) { const uint32_t bit_pos = MurmurHash1(key_bytes, key_len, 0) % (bitmap_bytes 8); const uint32_t word_idx = bit_pos / 32; const uint32_t bit_idx = bit_pos % 32; reinterpret_cast<uint32_t>(bitmap)[word_idx] |= 1 << bit_idx; }
It seems like what is happening is impossible. Here's a modified version of linear_probabilistic_count:
extern "C" NEVER_INLINE void linear_probabilistic_count(uint8_t bitmap, const uint32_t bitmap_bytes, const uint8_t key_bytes, const uint32_t key_len) { printf("bitmap: %p bitmap_bytes: %u key_len: %u\n", bitmap, bitmap_bytes, key_len); fflush(stdout); const uint32_t bit_pos = MurmurHash1(key_bytes, key_len, 0) % (bitmap_bytes 8); const uint32_t word_idx = bit_pos / 32; const uint32_t bit_idx = bit_pos % 32; printf("bit_pos: %u word_idx: %u bit_idx: %u\n", bit_pos, word_idx, bit_idx); fflush(stdout); reinterpret_cast<uint32_t>(bitmap)[word_idx] |= 1 << bit_idx; }
It fails after apparently succeeding with the exact same parameterization as the previous call:
bit_pos: 672551590 word_idx: 21017237 bit_idx: 6 bitmap: 0x21040af90010 bitmap_bytes: 268435456 key_len: 8 bit_pos: 672551590 word_idx: 21017237 bit_idx: 6 bitmap: 0x21040af90010 bitmap_bytes: 268435456 key_len: 8 bit_pos: 672551590 word_idx: 21017237 bit_idx: 6 bitmap: 0x21040af90010 bitmap_bytes: 268435456 key_len: 8 [note no "bit_pos" line]
Is there a way to use just one thread?
I tried lldb and the addresses are different from above. It is some new bitmap base address. That said, I don't get source level symbols when I build with ENABLE_JIT_DEBUG. Should I?
Well, we used to have builds that used PPC, but we stopped because no-one is asking; that's bad because there are systems with nv-link attached GPUs
I've got access to such a system, but I'm not using it because of the network it is on and the sensitivity of my data. Would Omnisci have a better performance there (e.g. with V100s with nvlink) than on an Intel-based system? (Just trying to justify the advocacy and work required to move and whether this is worth debugging.)
There is a benchmark somewhere that shows better performance on the first run of queries because of the CPU-->GPU nvlink connection (AFAIK, we don't actually make use of NVlinks between GPUs, but I think someone is working on it), so when data is transferred from System memory to GPU memory you will get a nice boost in performances, but if you have everything already on GPU nothing is going to change.
The V100s are likely faster than the GTX 1080ti but bot by much (20/25% because of the higher number of shaders and bandwidth).
Is there a way to use just one thread?
In CPU mode, the systems will allocate 1 thread for each fragment so that you can re-create a table with a fragment size > of the number of rows or filter the data from a fragment only. Todo that the simplest way is to filter with rowid
Assuming the fragmnet_size is the default 32M
select /+ cpu_mode / distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m where rowid between 1 and 10000000 ;
this will run just 1 thread
205ms total duration for executeRelAlgQuery
205ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:273
0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:274
205ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:595
205ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:674
205ms start(0ms) executeCompound RelAlgExecutor.cpp:1689
205ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:2837
5ms start(0ms) compileWorkUnit NativeCodegen.cpp:2481
New thread(26)
0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:93
0ms start(0ms) fetchChunks Execute.cpp:2404
0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:717
198ms start(0ms) executePlanWithGroupBy Execute.cpp:3008
198ms start(0ms) launchCpuCode QueryExecutionContext.cpp:596
0ms start(199ms) getRowSet QueryExecutionContext.cpp:154
End thread(26)
0ms start(205ms) collectAllDeviceResults Execute.cpp:1841
0ms start(205ms) reduceMultiDeviceResults Execute.cpp:901
0ms start(205ms) reduceMultiDeviceResultSets Execute.cpp:947
2021-03-24T10:06:50.043555 I 6079 17 DBHandler.cpp:1215 DEBUG_TIMER thread_id(17)
247ms total duration for sql_execute
2021-03-24T10:06:50.043632 I 6079 17 DBHandler.cpp:1212 stdlog sql_execute 19 247 omnisci admin 549-HwiO {"query_str","client","nonce","execution_time_ms","total_time_ms"} {"select /*+ cpu_mode */ distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m where rowid between 1 and 10000000 ;","tcp:localhost:59068","","246","247"}
on GPU, you can limit the number of threads using those parameters --cuda-block-size=1 --cuda-grid-size=1
this would run a kernel using a single thread... this query is taking 15000ms on a single GPU thread
omnisql> select distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m where rowid between 1 and 10000000 ;
Regards, Candido
What is the relationship between row id and fragment count in general? How do I know I tested the hypothesis that the problem is related to parallelism and not simply avoided the problem by limiting the scope of the query. Btw, I tried task_set -c 0 but that did not help.
I do have an intermediate table where I could introduce a large fragment size.
I do have an intermediate table where I could introduce a large fragment size.
As you prefere
What is the relationship between row id and fragment count in general? How do I know I tested the hypothesis that the problem is related to parallelism and not simply avoided the problem by limiting the scope of the query. Btw, I tried task_set -c 0 but that did not help.
the data on omniscidb is stored into fragments...they are something like partitions (or extents) and the min and max values of each filed is stored, so if you run a query with a filter, those values are checked before process the fragment; the rowid is a pseudo column that identifies a row, so the rowid between 0 and 31999999 will be in the first fragment if the fragment size is 32000000 the rows with rowid between 32000000 and 63999999 will be in the second and so on
Both the "task_set -c 0" experiment and using a duplicate table with an enormous fragment size result in the same symptom:
0x00002100035b03b4 in hash_join_idx () (gdb) up
(gdb)
(gdb)
(gdb)
ra_exe_unit=..., native_code=0x2101a0085b30, hoist_literals=true, literal_buff=..., col_buffers=..., num_rows=..., frag_offsets=...,
scan_limit=0, error_code=0x2100b081a7dc, num_tables=2, join_hash_tables=...)
at /home/mgd/src/omniscidb-5.5.2/QueryEngine/QueryExecutionContext.cpp:665
665 reinterpret_cast
It looks like that I am indeed only getting one active thread, although many are allocated but unused. Do you happen to know the last time PPC64 was tested? Perhaps I could go back to a corresponding version of LLVM to recreate that? On the other hand, I see that ENABLE_JIT_DEBUG, which I am using, disables LLVM optimization, so it seems like an esoteric optimization specific to PPC64 isn't a very plausible explanation.
I built a version of LLVM 9.0.1 which is what I see your binaries use. This fixed the crashes above on PPC64LE. However, I noticed that it has to be a Release build, otherwise this error arises:
omnisci_server: /home/mgd/src/llvm-project/llvm/lib/IR/Instructions.cpp:398: void llvm::CallInst::init(llvm::FunctionType, llvm::Value, llvm::ArrayRef<llvm::Value>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value> >, const llvm::Twine&): Assertion `(Args.size() == FTy->getNumParams() || (FTy->isVarArg() && Args.size() > FTy->getNumParams())) && "Calling a function with bad signature!"' failed. up
I suspect this may be related to why LLVM 11 does not work. The issue is that the Args object has zero count.
As for the determinism issue, that had a straightforward explanation which is that Omnisci has a limit of 100000 records from select. When there is a lot of parallelism it is possible to get different subsets.
Hi,
I've got a multi-GPU PP64LE machine that is running version 5.5.2 (4 K80) and a x86_64 that has two GPUs (2 1080 GTX Ti). The persistent storage on the former is GPFS and a dedicated SSD on the latter. I'm seeing varying results from the former, e.g. 229,715 in one run and then 229,587 in another. On the latter machine I see 269,252 on every run. This is SELECT DISTINCT.
Marcus