heavyai / heavydb

HeavyDB (formerly OmniSciDB)
https://heavy.ai
Apache License 2.0
2.92k stars 444 forks source link

determinism and SELECT DISTINCT #636

Open MarcusGDaniels opened 3 years ago

MarcusGDaniels commented 3 years ago

Hi,

I've got a multi-GPU PP64LE machine that is running version 5.5.2 (4 K80) and a x86_64 that has two GPUs (2 1080 GTX Ti). The persistent storage on the former is GPFS and a dedicated SSD on the latter. I'm seeing varying results from the former, e.g. 229,715 in one run and then 229,587 in another. On the latter machine I see 269,252 on every run. This is SELECT DISTINCT.

Marcus

cdessanti commented 3 years ago

Hi,

I never faced an issue like that; I expect some inconsistencies when using float especially on averages, but a SELECT DISTINCT is really unlikely to give you inconsistent values thru different hardware. Can you share the logs of the server with --verbose parameter switched to true while running the queries?

Have you tried to run the query in CPU mode? You have just to add a cpu_mode hint into the query using the usual Oracle's syntax

SELECT /+ cpu_mde / DISTINCT etc.

Candido.

MarcusGDaniels commented 3 years ago

Is there some way to confirm it is going to the CPU (short of figuring out how to instrument the source code)? I usually see 1 or 2 percent on the GPUs from nvidia-smi, so I'm not sure if that is normal idling or it ignoring the setting. Btw, I assume you mean "/+ cpu_mode /"? (Missing "o" in your example?)

MarcusGDaniels commented 3 years ago

yeah, I'm back to segfaults with the /+ cpu_mode /. I had gone to some trouble to get the CUDA build to run on this PPC system mainly because the non-CUDA build was segfaulting. The master (as opposed to 5.5.2) segfaults with or without CUDA.

cdessanti commented 3 years ago

There are several ways to determine on which device a query is being (or will be) executed.

If you type, explain SELECT DISTINCT, etc. The most simple and immediate is using EXPLAIN.

GPU

omnisql> explain select distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m
..> ;
Explanation
IR for the GPU:
[cut]

CPU

omnisql> explain select /*+ cpu_mode */  distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m ;
Explanation
IR for the CPU:
===============
[cut]

Probably the K80 is missing some capabilities, so the query is falling back to CPU already?

the have a better idea where and how a query is running on; I suggest you turn on the --enable-debug-timer switch looking into INFO logs, you would get something like that GPU (just 1 GPU)

2021-03-23T18:30:56.385935 I 6079 4 RelAlgExecutor.cpp:239 DEBUG_TIMER thread_id(4)
70ms total duration for executeRelAlgQuery
  70ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:273
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:274
    70ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:595
      70ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:674
        70ms start(0ms) executeCompound RelAlgExecutor.cpp:1689
          70ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:2837
            3ms start(0ms) compileWorkUnit NativeCodegen.cpp:2481
              New thread(14)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:93
                0ms start(0ms) fetchChunks Execute.cpp:2404
                0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:717
                65ms start(0ms) executePlanWithGroupBy Execute.cpp:3008
                  65ms start(0ms) launchGpuCode QueryExecutionContext.cpp:218
                  0ms start(66ms) getRowSet QueryExecutionContext.cpp:154
                    0ms start(66ms) reduceMultiDeviceResults Execute.cpp:901
                      0ms start(66ms) reduceMultiDeviceResultSets Execute.cpp:947
              End thread(14)
            0ms start(70ms) collectAllDeviceResults Execute.cpp:1841
              0ms start(70ms) reduceMultiDeviceResults Execute.cpp:901
                0ms start(70ms) reduceMultiDeviceResultSets Execute.cpp:947
2021-03-23T18:30:56.386124 I 6079 1 DBHandler.cpp:1215 DEBUG_TIMER thread_id(1)
84ms total duration for sql_execute
2021-03-23T18:30:56.386192 I 6079 1 DBHandler.cpp:1212 stdlog sql_execute 7 84 omnisci admin 914-GWtY {"query_str","client","nonce","execution_time_ms","total_time_ms"} {"select  distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m ;","tcp:localhost:56130","","83","84"}

CPU (4 Cores)

611ms total duration for executeRelAlgQuery
  611ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:273
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:274
    611ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:595
      611ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:674
        611ms start(0ms) executeCompound RelAlgExecutor.cpp:1689
          611ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:2837
            2ms start(0ms) compileWorkUnit NativeCodegen.cpp:2481
              New thread(9)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:93
                0ms start(0ms) fetchChunks Execute.cpp:2404
                0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:717
                586ms start(0ms) executePlanWithGroupBy Execute.cpp:3008
                  586ms start(0ms) launchCpuCode QueryExecutionContext.cpp:596
                  0ms start(586ms) getRowSet QueryExecutionContext.cpp:154
              End thread(9)
              New thread(10)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:93
                0ms start(0ms) fetchChunks Execute.cpp:2404
                0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:717
                608ms start(0ms) executePlanWithGroupBy Execute.cpp:3008
                  608ms start(0ms) launchCpuCode QueryExecutionContext.cpp:596
                  0ms start(609ms) getRowSet QueryExecutionContext.cpp:154
              End thread(10)
              New thread(11)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:93
                0ms start(0ms) fetchChunks Execute.cpp:2404
                0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:717
                587ms start(0ms) executePlanWithGroupBy Execute.cpp:3008
                  587ms start(0ms) launchCpuCode QueryExecutionContext.cpp:596
                  0ms start(587ms) getRowSet QueryExecutionContext.cpp:154
              End thread(11)
              New thread(12)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:93
                0ms start(0ms) fetchChunks Execute.cpp:2404
                0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:717
                321ms start(0ms) executePlanWithGroupBy Execute.cpp:3008
                  321ms start(0ms) launchCpuCode QueryExecutionContext.cpp:596
                  0ms start(321ms) getRowSet QueryExecutionContext.cpp:154
              End thread(12)
            0ms start(611ms) collectAllDeviceResults Execute.cpp:1841
              0ms start(611ms) reduceMultiDeviceResults Execute.cpp:901
                0ms start(611ms) reduceMultiDeviceResultSets Execute.cpp:947
2021-03-23T18:29:21.048137 I 6079 1 DBHandler.cpp:1215 DEBUG_TIMER thread_id(1)
627ms total duration for sql_execute
2021-03-23T18:29:21.048207 I 6079 1 DBHandler.cpp:1212 stdlog sql_execute 5 627 omnisci admin 914-GWtY {"query_str","client","nonce","execution_time_ms","total_time_ms"} {"select /*+ cpu_mode */  distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m ;","tcp:localhost:56130","","626","627"}
cdessanti commented 3 years ago

yeah, I'm back to segfaults with the /+ cpu_mode /. I had gone to some trouble to get the CUDA to build to run on this PPC system mainly because the non-CUDA build was segfaulting. The master (as opposed to 5.5.2) segfaults with or without CUDA.

Well we used to have builds that used PPC, but we stopped because no-one is asking; that's bad because there are systems with nv-link attached GPUs

MarcusGDaniels commented 3 years ago

Here's how things end in the INFO log for the cpu_mode on PPC. (Works fine on x86_64) I guess I'm back to a debug build. I guess there is no way to get LOC context for the kernel in the debugger, since it is JITed?

2021-03-23T13:46:32.684623 I 100855 1034 measure.h:74 Timer start lauchCpuCode launchCpuCode: 578 2021-03-23T13:46:33.578112 1 100855 1 MapDServer.cpp:271 heartbeat thread exiting 2021-03-23T13:46:33.578246 I 100855 1 MapDServer.cpp:276 Interrupt signal (11) received.

cdessanti commented 3 years ago

The debug build is slower than a release one.

It looks a failure of llvm compiled code...hummm... Intel guys introduced some performance enhancement in the llvm compile phase, but if I remember correctly they were just switches, so I am not sure of this could pose problems on ppc arch

MarcusGDaniels commented 3 years ago

yeah, I'm back to segfaults with the /+ cpu_mode /. I had gone to some trouble to get the CUDA to build to run on this PPC system mainly because the non-CUDA build was segfaulting. The master (as opposed to 5.5.2) segfaults with or without CUDA.

Well we used to have builds that used PPC, but we stopped because no-one is asking; that's bad because there are systems with nv-link attached GPUs

I've got access to such a system but I'm not using it because of the network it is on, and the sensitivity of my data. Would Omnisci have better performance there (e.g. with V100s with nvlink) than on an Intel-based system? (Just trying to justify the advocacy and work required to move, and whether this is worth debugging.)

MarcusGDaniels commented 3 years ago

The debug build is slower than a release one.

It looks a failure of llvm compiled code...hummm... Intel guys introduced some performance enhancement in the llvm compile phase, but if I remember correctly they were just switches, so I am not sure of this could pose problems on ppc arch

Here's the segfault, although the debugging build didn't really help pin down the line:

0x00002100029f012c in linear_probabilistic_count () (gdb) up

1 0x00002100029f04e0 in query_template ()

(gdb)

2 0x00002100029f0230 in multifrag_query_hoisted_literals ()

(gdb)

3 0x00000000125b4020 in QueryExecutionContext::launchCpuCode(RelAlgExecutionUnit const&, CpuCompilationContext const, bool, std::vector<signed char, std::allocator > const&, std::vector<std::vector<signed char const, std::allocator<signed char const> >, std::allocator<std::vector<signed char const, std::allocator<signed char const> > > >, std::vector<std::vector<long, std::allocator >, std::allocator<std::vector<long, std::allocator > > > const&, std::vector<std::vector<unsigned long, std::allocator >, std::allocator<std::vector<unsigned long, std::allocator > > > const&, int, int, unsigned int, std::vector<long, std::allocator > const&) (

this=0x2102a43e3fe0, ra_exe_unit=..., native_code=0x21002321a350, hoist_literals=true, literal_buff=..., col_buffers=..., num_rows=..., frag_offsets=..., scan_limit=0,
error_code=0x2102ae48a7dc, num_tables=2, join_hash_tables=...) at /home/mgd/src/omniscidb-5.5.2/QueryEngine/QueryExecutionContext.cpp:665

665 reinterpret_cast(native_code->func())(multifrag_cols_ptr,

alexbaden commented 3 years ago

Interesting -- you can actually put a breakpoint in that function in a debug build if you build w/ ENABLE_JIT_DEBUG=on. Were you able to get the QueryMemoryDescriptor output from the verbose logs?

MarcusGDaniels commented 3 years ago

This was the last thing in the log:

2021-03-23T16:26:21.020813 1 2093 6 Execute.cpp:2048 Query Memory Descriptor State Query Type: Estimator Keyless Hash: False Effective key width: 8 Number of group columns: 0 Col Slot Context State Empty Allow Multifrag: True Interleaved Bins on GPU: False Blocks Share Memory: False Threads Share Memory: True Uses Fast Group Values: False Lazy Init Groups (GPU): True Entry Count: 1 Min Val (perfect hash only): 0 Max Val (perfect hash only): 0 Bucket Val (perfect hash only): 0 Sort on GPU: False Use Streaming Top N: False Output Columnar: False Render Output: False Use Baseline Sort: False 2021-03-23T16:26:21.021202 1 2093 6 Execute.cpp:2110 Launching 108 kernels for query.

MarcusGDaniels commented 3 years ago

ok, looks like ENABLE_JIT_DEBUG is a build-time thing. will try.

MarcusGDaniels commented 3 years ago

The call sequence changed a bit. There's just a couple arrays in that function, the key_bytes that is used in the MurmurHash1 and the bitmap update.

(gdb) bt

0 0x0000210003240a9c in linear_probabilistic_count ()

1 0x00002100032411ec in filter_func ()

2 0x0000210003241138 in row_func ()

3 0x0000210003240fdc in query_template ()

4 0x0000210003240c30 in multifrag_query_hoisted_literals ()

5 0x00000000125b2ed0 in QueryExecutionContext::launchCpuCode(RelAlgExecutionUnit const&, CpuCompilationContext const, bool, std::vector<signed char, std::allocator > const&, std::vector<std::vector<signed char const, std::allocator<signed char const> >, std::allocator<std::vector<signed char const, std::allocator<signed char const> > > >, std::vector<std::vector<long, std::allocator >, std::allocator<std::vector<long, std::allocator > > > const&, std::vector<std::vector<unsigned long, std::allocator >, std::allocator<std::vector<unsigned long, std::allocator > > > const&, int, int, unsigned int, std::vector<long, std::allocator > const&) (

this=0x21013c9ff080, ra_exe_unit=..., native_code=0x2100ac1f5b20, hoist_literals=true, literal_buff=..., col_buffers=..., num_rows=..., frag_offsets=..., scan_limit=0,
error_code=0x2100fa07a7dc, num_tables=2, join_hash_tables=...) at /home/mgd/src/omniscidb-5.5.2/QueryEngine/QueryExecutionContext.cpp:665

extern "C" NEVER_INLINE void linear_probabilistic_count(uint8_t bitmap, const uint32_t bitmap_bytes, const uint8_t key_bytes, const uint32_t key_len) { const uint32_t bit_pos = MurmurHash1(key_bytes, key_len, 0) % (bitmap_bytes 8); const uint32_t word_idx = bit_pos / 32; const uint32_t bit_idx = bit_pos % 32; reinterpret_cast<uint32_t>(bitmap)[word_idx] |= 1 << bit_idx; }

MarcusGDaniels commented 3 years ago

It seems like what is happening is impossible. Here's a modified version of linear_probabilistic_count:

extern "C" NEVER_INLINE void linear_probabilistic_count(uint8_t bitmap, const uint32_t bitmap_bytes, const uint8_t key_bytes, const uint32_t key_len) { printf("bitmap: %p bitmap_bytes: %u key_len: %u\n", bitmap, bitmap_bytes, key_len); fflush(stdout); const uint32_t bit_pos = MurmurHash1(key_bytes, key_len, 0) % (bitmap_bytes 8); const uint32_t word_idx = bit_pos / 32; const uint32_t bit_idx = bit_pos % 32; printf("bit_pos: %u word_idx: %u bit_idx: %u\n", bit_pos, word_idx, bit_idx); fflush(stdout); reinterpret_cast<uint32_t>(bitmap)[word_idx] |= 1 << bit_idx; }

It fails after apparently succeeding with the exact same parameterization as the previous call:

bit_pos: 672551590 word_idx: 21017237 bit_idx: 6 bitmap: 0x21040af90010 bitmap_bytes: 268435456 key_len: 8 bit_pos: 672551590 word_idx: 21017237 bit_idx: 6 bitmap: 0x21040af90010 bitmap_bytes: 268435456 key_len: 8 bit_pos: 672551590 word_idx: 21017237 bit_idx: 6 bitmap: 0x21040af90010 bitmap_bytes: 268435456 key_len: 8 [note no "bit_pos" line]

MarcusGDaniels commented 3 years ago

Is there a way to use just one thread?

MarcusGDaniels commented 3 years ago

I tried lldb and the addresses are different from above. It is some new bitmap base address. That said, I don't get source level symbols when I build with ENABLE_JIT_DEBUG. Should I?

cdessanti commented 3 years ago

Well, we used to have builds that used PPC, but we stopped because no-one is asking; that's bad because there are systems with nv-link attached GPUs

I've got access to such a system, but I'm not using it because of the network it is on and the sensitivity of my data. Would Omnisci have a better performance there (e.g. with V100s with nvlink) than on an Intel-based system? (Just trying to justify the advocacy and work required to move and whether this is worth debugging.)

There is a benchmark somewhere that shows better performance on the first run of queries because of the CPU-->GPU nvlink connection (AFAIK, we don't actually make use of NVlinks between GPUs, but I think someone is working on it), so when data is transferred from System memory to GPU memory you will get a nice boost in performances, but if you have everything already on GPU nothing is going to change.

The V100s are likely faster than the GTX 1080ti but bot by much (20/25% because of the higher number of shaders and bandwidth).

Is there a way to use just one thread?

In CPU mode, the systems will allocate 1 thread for each fragment so that you can re-create a table with a fragment size > of the number of rows or filter the data from a fragment only. Todo that the simplest way is to filter with rowid

Assuming the fragmnet_size is the default 32M

select /+ cpu_mode / distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m where rowid between 1 and 10000000 ;

this will run just 1 thread

205ms total duration for executeRelAlgQuery
  205ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:273
    0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:274
    205ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:595
      205ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:674
        205ms start(0ms) executeCompound RelAlgExecutor.cpp:1689
          205ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:2837
            5ms start(0ms) compileWorkUnit NativeCodegen.cpp:2481
              New thread(26)
                0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:93
                0ms start(0ms) fetchChunks Execute.cpp:2404
                0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:717
                198ms start(0ms) executePlanWithGroupBy Execute.cpp:3008
                  198ms start(0ms) launchCpuCode QueryExecutionContext.cpp:596
                  0ms start(199ms) getRowSet QueryExecutionContext.cpp:154
              End thread(26)
            0ms start(205ms) collectAllDeviceResults Execute.cpp:1841
              0ms start(205ms) reduceMultiDeviceResults Execute.cpp:901
                0ms start(205ms) reduceMultiDeviceResultSets Execute.cpp:947
2021-03-24T10:06:50.043555 I 6079 17 DBHandler.cpp:1215 DEBUG_TIMER thread_id(17)
247ms total duration for sql_execute
2021-03-24T10:06:50.043632 I 6079 17 DBHandler.cpp:1212 stdlog sql_execute 19 247 omnisci admin 549-HwiO {"query_str","client","nonce","execution_time_ms","total_time_ms"} {"select /*+ cpu_mode */ distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m where rowid between 1 and 10000000 ;","tcp:localhost:59068","","246","247"}

on GPU, you can limit the number of threads using those parameters --cuda-block-size=1 --cuda-grid-size=1

this would run a kernel using a single thread... this query is taking 15000ms on a single GPU thread

omnisql> select distinct DATE_TRUNC(month, dep_timestamp) from flights_sk_4_120m where rowid between 1 and 10000000 ;

Regards, Candido

MarcusGDaniels commented 3 years ago

What is the relationship between row id and fragment count in general? How do I know I tested the hypothesis that the problem is related to parallelism and not simply avoided the problem by limiting the scope of the query. Btw, I tried task_set -c 0 but that did not help.

MarcusGDaniels commented 3 years ago

I do have an intermediate table where I could introduce a large fragment size.

cdessanti commented 3 years ago

I do have an intermediate table where I could introduce a large fragment size.

As you prefere

What is the relationship between row id and fragment count in general? How do I know I tested the hypothesis that the problem is related to parallelism and not simply avoided the problem by limiting the scope of the query. Btw, I tried task_set -c 0 but that did not help.

the data on omniscidb is stored into fragments...they are something like partitions (or extents) and the min and max values of each filed is stored, so if you run a query with a filter, those values are checked before process the fragment; the rowid is a pseudo column that identifies a row, so the rowid between 0 and 31999999 will be in the first fragment if the fragment size is 32000000 the rows with rowid between 32000000 and 63999999 will be in the second and so on

MarcusGDaniels commented 3 years ago

Both the "task_set -c 0" experiment and using a duplicate table with an enormous fragment size result in the same symptom:

0x00002100035b03b4 in hash_join_idx () (gdb) up

1 0x00002100035b1174 in row_func ()

(gdb)

2 0x00002100035b104c in query_template ()

(gdb)

3 0x00002100035b0ca0 in multifrag_query_hoisted_literals ()

(gdb)

4 0x00000000125b2ed0 in QueryExecutionContext::launchCpuCode(RelAlgExecutionUnit const&, CpuCompilationContext const, bool, std::vector<signed char, std::allocator > const&, std::vector<std::vector<signed char const, std::allocator<signed char const> >, std::allocator<std::vector<signed char const, std::allocator<signed char const> > > >, std::vector<std::vector<long, std::allocator >, std::allocator<std::vector<long, std::allocator > > > const&, std::vector<std::vector<unsigned long, std::allocator >, std::allocator<std::vector<unsigned long, std::allocator > > > const&, int, int, unsigned int, std::vector<long, std::allocator > const&) (this=0x2101fc001670,

ra_exe_unit=..., native_code=0x2101a0085b30, hoist_literals=true, literal_buff=..., col_buffers=..., num_rows=..., frag_offsets=...,
scan_limit=0, error_code=0x2100b081a7dc, num_tables=2, join_hash_tables=...)
at /home/mgd/src/omniscidb-5.5.2/QueryEngine/QueryExecutionContext.cpp:665

665 reinterpret_cast(native_code->func())(multifrag_cols_ptr,

MarcusGDaniels commented 3 years ago

It looks like that I am indeed only getting one active thread, although many are allocated but unused. Do you happen to know the last time PPC64 was tested? Perhaps I could go back to a corresponding version of LLVM to recreate that? On the other hand, I see that ENABLE_JIT_DEBUG, which I am using, disables LLVM optimization, so it seems like an esoteric optimization specific to PPC64 isn't a very plausible explanation.

MarcusGDaniels commented 3 years ago

I built a version of LLVM 9.0.1 which is what I see your binaries use. This fixed the crashes above on PPC64LE. However, I noticed that it has to be a Release build, otherwise this error arises:

omnisci_server: /home/mgd/src/llvm-project/llvm/lib/IR/Instructions.cpp:398: void llvm::CallInst::init(llvm::FunctionType, llvm::Value, llvm::ArrayRef<llvm::Value>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value> >, const llvm::Twine&): Assertion `(Args.size() == FTy->getNumParams() || (FTy->isVarArg() && Args.size() > FTy->getNumParams())) && "Calling a function with bad signature!"' failed. up

I suspect this may be related to why LLVM 11 does not work. The issue is that the Args object has zero count.

As for the determinism issue, that had a straightforward explanation which is that Omnisci has a limit of 100000 records from select. When there is a lot of parallelism it is possible to get different subsets.