Closed marioroy closed 1 year ago
The following workaround resolves the issue; output completing in 0.001 seconds.
97,98c97,98
< [&word_counts](std::vector<std::string>&& lines) {
< for (auto& line : lines) {
---
> [&word_counts, &lines_array, i] {
> for (auto& line : lines_array[i]) {
121,122c121
< },
< std::move(lines_array[i])
---
> }
Ah, C++ boggles my mind :). Thank you, for the parallel hash map library.
Previously, seeing output in mt_word_counter
take over 1 second, for less than 8,000 key-value pairs, puzzled me. This is now resolved in my gist.
8 threads: Clear Linux (left), Fedora 38 (right)
$ NUM_THREADS=8 ./mt_word_counter | cksum
# memory consumption ~ 1360 MB
populate arrays 0.991 secs 1.037 secs
compute word counts 4.847 secs 5.307 secs
output results 0.001 secs 0.001 secs
total time 5.839 secs 6.346 secs
2419766184 3124
$ NUM_THREADS=8 ./omp_word_counter | cksum
# memory consumption < 1 MB
compute word counts 4.575 secs 5.000 secs
output results 0.001 secs 0.001 secs
total time 4.577 secs 5.002 secs
2419766184 3124
$ NUM_THREADS=8 ./thr_word_counter | cksum
# memory consumption < 1 MB
compute word counts 4.574 secs 5.019 secs
output results 0.001 secs 0.001 secs
total time 4.576 secs 5.021 secs
2419766184 3124
24 threads: Clear Linux (left), Fedora 38 (right)
$ NUM_THREADS=24 ./mt_word_counter | cksum
# memory consumption ~ 1372 MB
populate arrays 1.056 secs 1.119 secs
compute word counts 2.285 secs 2.762 secs
output results 0.001 secs 0.001 secs
total time 3.344 secs 3.884 secs
2419766184 3124
$ NUM_THREADS=24 ./omp_word_counter | cksum
# memory consumption < 1 MB
compute word counts 2.209 secs 2.573 secs
output results 0.001 secs 0.001 secs
total time 2.210 secs 2.575 secs
2419766184 3124
$ NUM_THREADS=24 ./thr_word_counter | cksum
# memory consumption < 1 MB
compute word counts 2.210 secs 2.571 secs
output results 0.001 secs 0.001 secs
total time 2.212 secs 2.573 secs
2419766184 3124
I just realized that the individual threads can update a local map. Subsequently, the threads update the shared map after exiting the loop. This is something I tried in order to scale better.
The gist was updated to showcase both parallel_flat_hash_map_m
(for shared map) and flat_hash_map
(for local map). What I find interesting is not experiencing any extra IO time for the chunking demonstrations when comparing compute.
8 threads: Clear Linux (left), Fedora 38 (right)
$ NUM_THREADS=8 ./mt_word_counter | cksum
# memory consumption ~ 1360 MB
populate arrays 0.983 secs 1.042 secs
compute word counts 1.586 secs 1.924 secs
output results 0.001 secs 0.001 secs
total time 2.571 secs 2.968 secs
2419766184 3124
$ NUM_THREADS=8 ./omp_word_counter | cksum
# memory consumption < 1 MB
compute word counts 1.575 secs 1.799 secs
output results 0.001 secs 0.001 secs
total time 1.577 secs 1.801 secs
2419766184 3124
$ NUM_THREADS=8 ./thr_word_counter | cksum
# memory consumption < 1 MB
compute word counts 1.578 secs 1.779 secs
output results 0.001 secs 0.001 secs
total time 1.580 secs 1.781 secs
2419766184 3124
24 threads: Clear Linux (left), Fedora 38 (right)
$ NUM_THREADS=24 ./mt_word_counter | cksum
# memory consumption ~ 1372 MB
populate arrays 1.056 secs 1.114 secs
compute word counts 0.543 secs 0.623 secs
output results 0.001 secs 0.001 secs
total time 1.601 secs 1.740 secs
2419766184 3124
$ NUM_THREADS=24 ./omp_word_counter | cksum
# memory consumption < 1 MB
compute word counts 0.537 secs 0.606 secs
output results 0.001 secs 0.001 secs
total time 0.538 secs 0.608 secs
2419766184 3124
$ NUM_THREADS=24 ./thr_word_counter | cksum
# memory consumption < 1 MB
compute word counts 0.537 secs 0.597 secs
output results 0.001 secs 0.001 secs
total time 0.539 secs 0.599 secs
2419766184 3124
I'm closing the issue. I was able to work around the anomaly.
I created a new gist C++ parallel chunking demonstrations, inspired by your
mt_word_counter.cc
example.Once parallel processing is completed, the time to output takes
mt_word_counter
longer than expected. I'm not sure if this is a bug or nothing more than an anomaly. This anomaly caused me to create another chunking variant to factor out OpenMP. Bothomp_word_counter
andthr_word_counter
work as expected, time wise.There are less than 8,000 (or 7,652 to be exact) key-value pairs in the
word_counts
map table.I also ran on two different Linux distributions.
8 threads: Clear Linux (left), Fedora 38 (right)
24 threads: Clear Linux (left), Fedora 38 (right)