heavyai / heavydb

HeavyDB (formerly OmniSciDB)
https://heavy.ai
Apache License 2.0
2.96k stars 448 forks source link

Loading billions rows to tables having string dictionary encoding columns is relatively slow. #713

Open gunlinan opened 2 years ago

gunlinan commented 2 years ago

We then tried g_enable_stringdict_parallel=true option but found function StringDictionary::getOrAddBulkParallel seems a gammon; at least its name is. It has the same bottleneck as StringDictionary::getOrAddBulk does (on rwmutex).

Is (distributed?) string dictionary of paid OmniSci Enterprise faster??

cdessanti commented 2 years ago

Hi @gunlinan ,

Have you tried to turn on the --stringdict-parallelizm parameter? For example, I would turn on the StringDictionary::getOrAddBulkParallel method while working with dictionary encoded strings. The speed of the copy is better assuming you are not CPU limited; the TP of import with a 12 core TR 1920X processor is around 3M records/sec compared to 2.1M without the parameter, lòoading the lineitem table of tpch benchmark from an uncompressed CSV.

About the enterprise edition, I never tried, I will try to get some infos

Regards, Candido

gunlinan commented 2 years ago

We had tried --stringdict-parallelizm, found no obvious gain, so then tried hardcoded g_enable_stringdict_parallel=true.

In gdb, we set a breakpoint on https://github.com/omnisci/omniscidb/blob/master/StringDictionary/StringDictionary.cpp#L444 and disabled gdb all-stop mode. Our box has a 8-core cpu. Whenever omniscidb hit the breakpoint, it had 7 other threads blocked on https://github.com/omnisci/omniscidb/blob/master/StringDictionary/StringDictionary.cpp#L437. This shows that StringDictionary::getOrAddBulkParallel is not parallel.

cdessanti commented 2 years ago

Hi,

I used my workstation with 12c/24t, using an uncompressed file as a source (using a gzipped limited the CPU use to about 600% in both cases), and turning on the stringdict-parallelizm parameter, the use of CPU climbed from 750% to 1200%.

The StringDictionary::getOrAddBulkParallel looks enhanced in the part where the hash for the string is computed. After that, as you noticed, the lock serializes the rest of the function; I guess it's needed to ensure that no more than one id is assigned for the exact string. Anyway, there is an improvement in setting the parameter.

I also tried to set the thread count of the import threads to 8 using parameter --max-importthreads and the difference as expected is almost un.-existent 2.3 vs 2.1. So I think the effectiveness of this parameter depends from the number of cores and the data you are using (I guiess it make almost no difference in low cardinality columns containing strings)

Regards, Candido

gunlinan commented 2 years ago

Good tries. Thank you.

Low or high cardinality the serialized part takes most toll. Due to exclusive lock more cpu cores simply have more threads blocked (& more context switches) regardless cardinality. Higher cardinality case makes thing worse as it leads to higher rehash cost and longer system underutilization.

Essentially StringDictionary is another (& average) implementation of locked hash table, IWMSS.

Should try lock free DS.

cdessanti commented 2 years ago

Hi,

If you have high cardinality fields, the hashing part is the one that is going to limit the performance more, and that's the reason why I get a higher throughput with a higher number of cores available.

the TP is of 2.9 million records per second with 12 threads and a CPU consumption of 1200% using the parallel path 2.1 million per second a CPU consumption of 750% using the standard path.

Removing the lock doesn't change so much, except a slightly increased use of CPU and a slower TP.

The standard path performs better than the parallel one with a broader table with tenths of dictionary encoded columns. The standard approach performs better than the parallel one. 1.6 million records per second with the parallel path with a CPU usage of 1800% 1.9 with the standard path with usage of 1200% of CPU. Also, here there isn't any benefit in removing the lock.

Anyway, we are improving the performance of data load, since the 5.6 release loading tables like lineitem, so with high cardinality dictionary encoded text, has been greatly enhanced on both paths.

from 1.7 to 2.8 using parallel and 1.5 to 2.1 in the standard way

I did those tests using the built-in copy command that's inherently parallel; what are you using to do the ingestion of data?

Best regards, Candido

gunlinan commented 2 years ago

We ran copy command too.

In your first scenario, ratios are cpu: 1200% / 750% = 160% TP: 2.9M / 2.1M = 138%

That means increasing cpu usage by 60% improves throughput only by 38%. Over 1/3 increased cpu usage produced nothing but heat in system (likely on context switches). The waste looks worse for the broader table.

Not sure how the gain is with the same data on a table with 100 shards, because we are interested how well the thing goes on multi-node configurations.

Not doubting the implementation or improvement, but let's put it this way: If plain hash map (plus parallel hash calculation) is enough, why concurrent hash map?

cdessanti commented 2 years ago

Hi,

the gains I reported used a sharded table with just two shards (I have 2 GPUs in the system and I have to join that table with another), but with such a large number of shards is likely to slow down, especially in case you have high cardinality texts. with such big number of shards are you planning to use a bit number of servers or a smaller number of servers with some GPUs on it Candido

gunlinan commented 2 years ago

We project that a big number of shards can help us better mimic HA mechanism of Cassandra so that our planned multi-tenant OLAP cloud services can fail over one or more malfunctioning CPU nodes or GPU cards without requiring extra CPU/GPU servers to standby or spinning up new servers while we need not worry about memory thrashings between CPUs/GPUs and between CPUs/storage. If shard number is small, redistributed shards (shard hand-off) may easily fill up CPU/GPU memory of the servers that take over the shards (from multiple tenants), causing memory thrashings.

This brings up a question: Without spinning up new or standby servers, can omniscidb keep running in case any CPU node or GPU card fails? We wish omnisci had opened source HA related code so that we'd not need to bug you here with similar architectural questions 😜