Possible anomaly running the mt_word_counter example

marioroy commented 1 year ago

I created a new gist C++ parallel chunking demonstrations, inspired by your mt_word_counter.cc example.

Once parallel processing is completed, the time to output takes mt_word_counter longer than expected. I'm not sure if this is a bug or nothing more than an anomaly. This anomaly caused me to create another chunking variant to factor out OpenMP. Both omp_word_counter and thr_word_counter work as expected, time wise.

There are less than 8,000 (or 7,652 to be exact) key-value pairs in the word_counts map table.

I also ran on two different Linux distributions.

8 threads: Clear Linux (left), Fedora 38 (right)

$ NUM_THREADS=8 ./mt_word_counter | cksum
# memory consumption ~ 1360 MB
populate arrays        0.993 secs    1.031 secs
compute word counts    5.142 secs    5.601 secs
output results         1.082 secs    1.100 secs
total time             7.218 secs    7.733 secs
2419766184 3124

$ NUM_THREADS=8 ./omp_word_counter | cksum
# memory consumption < 1 MB
compute word counts    4.575 secs    5.000 secs
output results         0.001 secs    0.001 secs
total time             4.577 secs    5.002 secs
2419766184 3124

$ NUM_THREADS=8 ./thr_word_counter | cksum
# memory consumption < 1 MB
compute word counts    4.574 secs    5.019 secs
output results         0.001 secs    0.001 secs
total time             4.576 secs    5.021 secs
2419766184 3124

24 threads: Clear Linux (left), Fedora 38 (right)

$ NUM_THREADS=24 ./mt_word_counter | cksum
# memory consumption ~ 1372 MB
populate arrays        1.082 secs    1.111 secs
compute word counts    2.574 secs    2.874 secs
output results         1.146 secs    1.177 secs
total time             4.803 secs    5.162 secs
2419766184 3124

$ NUM_THREADS=24 ./omp_word_counter | cksum
# memory consumption < 1 MB
compute word counts    2.209 secs    2.573 secs
output results         0.001 secs    0.001 secs
total time             2.210 secs    2.575 secs
2419766184 3124

$ NUM_THREADS=24 ./thr_word_counter | cksum
# memory consumption < 1 MB
compute word counts    2.210 secs    2.571 secs
output results         0.001 secs    0.001 secs
total time             2.212 secs    2.573 secs
2419766184 3124

marioroy commented 1 year ago

The following workaround resolves the issue; output completing in 0.001 seconds.

97,98c97,98
<             [&word_counts](std::vector<std::string>&& lines) {
<                 for (auto& line : lines) {
---
>             [&word_counts, &lines_array, i] {
>                 for (auto& line : lines_array[i]) {
121,122c121
<             },
<             std::move(lines_array[i])
---
>             }

Ah, C++ boggles my mind :). Thank you, for the parallel hash map library.

marioroy commented 1 year ago

Previously, seeing output in mt_word_counter take over 1 second, for less than 8,000 key-value pairs, puzzled me. This is now resolved in my gist.

8 threads: Clear Linux (left), Fedora 38 (right)

$ NUM_THREADS=8 ./mt_word_counter | cksum
# memory consumption ~ 1360 MB
populate arrays        0.991 secs    1.037 secs
compute word counts    4.847 secs    5.307 secs
output results         0.001 secs    0.001 secs
total time             5.839 secs    6.346 secs
2419766184 3124

$ NUM_THREADS=8 ./omp_word_counter | cksum
# memory consumption < 1 MB
compute word counts    4.575 secs    5.000 secs
output results         0.001 secs    0.001 secs
total time             4.577 secs    5.002 secs
2419766184 3124

$ NUM_THREADS=8 ./thr_word_counter | cksum
# memory consumption < 1 MB
compute word counts    4.574 secs    5.019 secs
output results         0.001 secs    0.001 secs
total time             4.576 secs    5.021 secs
2419766184 3124

24 threads: Clear Linux (left), Fedora 38 (right)

$ NUM_THREADS=24 ./mt_word_counter | cksum
# memory consumption ~ 1372 MB
populate arrays        1.056 secs    1.119 secs
compute word counts    2.285 secs    2.762 secs
output results         0.001 secs    0.001 secs
total time             3.344 secs    3.884 secs
2419766184 3124

$ NUM_THREADS=24 ./omp_word_counter | cksum
# memory consumption < 1 MB
compute word counts    2.209 secs    2.573 secs
output results         0.001 secs    0.001 secs
total time             2.210 secs    2.575 secs
2419766184 3124

$ NUM_THREADS=24 ./thr_word_counter | cksum
# memory consumption < 1 MB
compute word counts    2.210 secs    2.571 secs
output results         0.001 secs    0.001 secs
total time             2.212 secs    2.573 secs
2419766184 3124

marioroy commented 1 year ago

I just realized that the individual threads can update a local map. Subsequently, the threads update the shared map after exiting the loop. This is something I tried in order to scale better.

The gist was updated to showcase both parallel_flat_hash_map_m (for shared map) and flat_hash_map (for local map). What I find interesting is not experiencing any extra IO time for the chunking demonstrations when comparing compute.

8 threads: Clear Linux (left), Fedora 38 (right)

$ NUM_THREADS=8 ./mt_word_counter | cksum
# memory consumption ~ 1360 MB
populate arrays        0.983 secs    1.042 secs
compute word counts    1.586 secs    1.924 secs
output results         0.001 secs    0.001 secs
total time             2.571 secs    2.968 secs
2419766184 3124

$ NUM_THREADS=8 ./omp_word_counter | cksum
# memory consumption < 1 MB
compute word counts    1.575 secs    1.799 secs
output results         0.001 secs    0.001 secs
total time             1.577 secs    1.801 secs
2419766184 3124

$ NUM_THREADS=8 ./thr_word_counter | cksum
# memory consumption < 1 MB
compute word counts    1.578 secs    1.779 secs
output results         0.001 secs    0.001 secs
total time             1.580 secs    1.781 secs
2419766184 3124

24 threads: Clear Linux (left), Fedora 38 (right)

$ NUM_THREADS=24 ./mt_word_counter | cksum
# memory consumption ~ 1372 MB
populate arrays        1.056 secs    1.114 secs
compute word counts    0.543 secs    0.623 secs
output results         0.001 secs    0.001 secs
total time             1.601 secs    1.740 secs
2419766184 3124

$ NUM_THREADS=24 ./omp_word_counter | cksum
# memory consumption < 1 MB
compute word counts    0.537 secs    0.606 secs
output results         0.001 secs    0.001 secs
total time             0.538 secs    0.608 secs
2419766184 3124

$ NUM_THREADS=24 ./thr_word_counter | cksum
# memory consumption < 1 MB
compute word counts    0.537 secs    0.597 secs
output results         0.001 secs    0.001 secs
total time             0.539 secs    0.599 secs
2419766184 3124

marioroy commented 1 year ago

I'm closing the issue. I was able to work around the anomaly.

greg7mdp / parallel-hashmap

Possible anomaly running the mt_word_counter example #199