Ability to reset the inner sub map

marioroy commented 5 months ago

Hi, @greg7mdp

I tested variable length keys. It turns out the llil4map.cc demonstration consumes the most memory consumption. The reason is unable to clear or reset the inner sub map. Without this ability, memory consumption peaks 18.4 GB. Otherwise, 15.1 GB with reset_inner capability.

Here, we get ready to insert the key,value pairs into a vector.

         propvec.resize(map.size());
         std::vector<vec_str_int_type::iterator> I; I.resize(map.subcnt());
         I[0] = propvec.begin();
         size_t prev_num_keys;

         for (size_t i = 0; i < map.subcnt(); ++i) {
            map.with_submap(i, [&](const map_str_int_type::EmbeddedSet& set) {
               if (i == 0) {
                  prev_num_keys = set.size();
               } else {
                  I[i] = I[i-1] + prev_num_keys;
                  prev_num_keys = set.size();
               }
            });
         }  

         #pragma omp parallel for schedule(static, 1) num_threads(5)
         for (size_t i = 0; i < map.subcnt(); ++i) {
            map.with_submap(i, [&](const map_str_int_type::EmbeddedSet& set) {
               auto it = I[i];
               for (auto& x : set)
                  *it++ = std::make_pair(std::move(x.first), x.second);
            });
            map.reset_inner(i);             // <--- need ability to reset the inner sub map here
         }

I added the reset_inner method to phmap.h, locally on my machine.

diff --git a/parallel_hashmap/phmap.h b/parallel_hashmap/phmap.h
index d9a5b7b..c8a3d62 100644
--- a/parallel_hashmap/phmap.h
+++ b/parallel_hashmap/phmap.h
@@ -3444,6 +3444,10 @@ public:
         return  sets_[idx];
     }

+    void reset_inner(size_t idx) {
+        sets_[idx].set_ = EmbeddedSet();
+    }
+
     // Extension API: support for heterogeneous keys.
     //
     //   std::unordered_set<std::string> s;

Great news!

This change enables llil4map to reach llil4hmap performance, processing variable length keys. Previously, the llil4map demonstration ran slower. Being able to clear/reset a sub map early is beneficial when the sub map is no longer needed. In the C++ snippet above, parallel threads process each sub map individually, "map to vector" below. I prefer for the map memory consumption to decrease while the vector memory increases and not wait until the end to clear the map. No issues for fixed-length keys.

llil4map.cc

The sub maps are managed by the phmap library.

$ ./llil4map in/big* in/big* in/big* | cksum
llil4map start
use OpenMP
use boost sort       Before 18.4GB     After 15.1GB
get properties         9.032 secs        8.969 secs
map to vector          3.054 secs        2.019 secs    <--- improvement
vector stable sort     3.099 secs        2.944 secs
write stdout           0.587 secs        0.553 secs
total time            15.774 secs       14.486 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

llil4hmap.cc

The sub maps are managed at the application level. Notice, similar times for "get properties" and "map to vector". The key hash_value is computed once only and stored with the key. I was curious at the time and left it this way.

$ ./llil4hmap in/big* in/big* in/big* | cksum
llil4hmap start
use OpenMP
use boost sort
get properties         8.896 secs
map to vector          2.075 secs
vector stable sort     2.957 secs
write stdout           0.586 secs
total time            14.516 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

Now, I understand why the two phmap demonstrations did not perform similarly before. You may like to know how far behind emhash. The llil4map is 0.9 seconds apart with reset_inner capability. Previously, 2.2 seconds apart total time. Please, no worries about phmap vs. emhash. Lacking the ability to clear/reset the sub map was the reason greater than 2 seconds apart (for variable length keys).

llil4emh.cc

Here too, the sub maps are managed at the application level. Including, key hash_value stored with the key.

$ ./llil4emh in/big* in/big* in/big* | cksum
llil4emh start
use OpenMP
use boost sort
get properties         7.767 secs
map to vector          2.192 secs
vector stable sort     3.075 secs
write stdout           0.543 secs
total time            13.578 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

greg7mdp commented 5 months ago

Thanks for the great explanation @marioroy .

Can you do the reset inside the with_submap_n, as in:

         for (size_t i = 0; i < map.subcnt(); ++i) {
            map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
               auto it = I[i];
               for (auto& x : set)
                  *it++ = std::make_pair(std::move(x.first), x.second);
               set = map_str_int_type::EmbeddedSet();    // <------------------------------
            });
         }

marioroy commented 5 months ago

Thanks. Per your suggestion (removing const) and inserting line.

         for (size_t i = 0; i < map.subcnt(); ++i) {
            map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
               auto it = I[i];
               for (auto& x : set)
                  *it++ = std::make_pair(std::move(x.first), x.second);
               set = map_str_int_type::EmbeddedSet();
            });
         }

The clang++ compiler fails.

In file included from llil4map.cc:51:
./parallel-hashmap/parallel_hashmap/phmap.h:3431:9: error: no matching function for call to object of type '(lambda at llil4map.cc:415:32)'
 3431 |         fCallback(set);
      |         ^~~~~~~~~
llil4map.cc:415:17: note: in instantiation of function template specialization 'phmap::priv::parallel_hash_set<12, phmap::priv::raw_hash_set, spinlock_mutex, phmap::priv::FlatHashMapPolicy<std::basic_string<char>, unsigned int>, phmap::priv::StringHashEqT<char>::Hash, phmap::priv::StringHashEqT<char>::Eq, std::allocator<std::pair<const std::basic_string<char>, unsigned int>>>::with_submap<(lambda at llil4map.cc:415:32)>' requested here
  415 |             map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
      |                 ^
llil4map.cc:415:32: note: candidate function not viable: 1st argument ('const EmbeddedSet' (aka 'const raw_hash_set<phmap::priv::FlatHashMapPolicy<std::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned int>, phmap::priv::StringHashEqT<char>::Hash, phmap::priv::StringHashEqT<char>::Eq, std::allocator<std::pair<const std::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned int>>>')) would lose const qualifier
  415 |             map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
      |                                ^   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

greg7mdp commented 5 months ago

Sorry, my bad.

After the loop, and instead if map.reset_inner(i);, use the following which is equivalent to your reset_inner:

map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
   set = map_str_int_type::EmbeddedSet();
});

marioroy commented 5 months ago

Thank you, @greg7mdp. Your suggestion works.

         for (size_t i = 0; i < map.subcnt(); ++i) {
            map.with_submap(i, [&](const map_str_int_type::EmbeddedSet& set) {
               auto it = I[i];
               for (auto& x : set)
                  *it++ = std::make_pair(std::move(x.first), x.second);
            });
            // The sub map is no longer needed. Reset the set to reclaim memory.
            map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
               set = map_str_int_type::EmbeddedSet();
            });
         }

Notice "map to vector" completing in 2 seconds. This was taking 3 seconds, previously.

$ ./llil4map in/big* in/big* in/big* | cksum
llil4map start
use OpenMP
use boost sort
get properties         8.946 secs
map to vector          2.018 secs    <-- Yeah! Thank you!
vector stable sort     2.960 secs
write stdout           0.589 secs
total time            14.514 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

greg7mdp commented 5 months ago

Hey @marioroy , I just tried your lill4map.cc myself. Would you mind if I added it to phmap as an example (with your attribution comment at the top of lill4map.cc of course).

Which parameters did you use for gen-llil.pl. I used the ones from the comment and I got:

❯ NUM_THREADS=3 ../ex_lill4map big1.txt big2.txt big3.txt >out.txt
llil4map (fixed string length=12) start
use OpenMP
don't use boost sort
get properties         0.695 secs
map to vector          0.093 secs
vector stable sort     1.316 secs
write stdout           0.061 secs
total time             2.166 secs
    count lines     10545600
    count unique    10368458

marioroy commented 5 months ago

Wow! Do not mind. llil4map.cc is final.

Thank you for your help, @greg7mdp; (1) some time ago, found one-off error in chunking; (2) at the time, helped with map to vector allowing parallel; (3) and now, helped ensure minimum memory consumption. Blessings and grace.

That's it, just as written in the comments for generating the input files.

perl gen-llil.pl big1.txt 200 3 1
perl gen-llil.pl big1.txt 200 3 1
perl gen-llil.pl big1.txt 200 3 1

I found my script for making 92 input files + shuffled (requires 2.8 GB). On my machine (where llil4map.cc resides), in is a symbolic link to /data/input. Change the path accordingly.

#!/bin/bash
# Script for generating input files for llil4map.cc.

mkdir -p /data/input
cp gen-llil.pl shuffle.pl /data/input
cd /data/input

# create 26 random files
for n in $(perl -le "print for 'aa'..'az'"); do
   perl gen-llil.pl big$n 200 3 1
   perl shuffle.pl big$n >1; mv 1 big$n
done &

# create 26 random files
for n in $(perl -le "print for 'ba'..'bz'"); do
   perl gen-llil.pl big$n 200 3 1
   perl shuffle.pl big$n >2; mv 2 big$n
done &

# create 26 random files
for n in $(perl -le "print for 'ca'..'cz'"); do
   perl gen-llil.pl big$n 200 3 1
   perl shuffle.pl big$n >3; mv 3 big$n
done &

# create 14 random files (total 92 files)
for n in $(perl -le "print for 'da'..'dn'"); do
   perl gen-llil.pl big$n 200 3 1
   perl shuffle.pl big$n >4; mv 4 big$n
done &

wait

The default mode in llil4map.cc is fixed string length = 12. Commenting out line 119 becomes variable length keys (using std::basic_string). This requires 16 GB to process the 92 input files when commented out. Otherwise, much less for fixed-length keys.

// #define MAX_STR_LEN_L (size_t) 12

My PerlMonks friend eyepopslikeamosquito helped me learn C++.

I enjoy parallel processing. So naturally, I tried: (1) chunk IO in get properties using spinlock_mutex; (2) map to vector, this too is parallel; (3) parallel sort; (4) finally, parallel output but orderly.

greg7mdp commented 5 months ago

Wow! Do not mind. llil4map.cc is final.

Thanks, I'll add it as an example then, it is a great example!

greg7mdp commented 5 months ago

Added! Looks like you did an amazing job learning C++!

I saw a small perf. issue in lill4map.cc, where using the non-mutable with_submap does not allow to std::move the string out of the map. Could be an issue if long strings are used (and if not using the fixed string).

Here is my updated version.

marioroy commented 5 months ago

Thanks. I learned also, reading phmap_fwd_decl.h and your examples.

I tried your updated version and added line to reset the set, especially beneficial for long strings. Without it, the "map to vector" takes 3 seconds for long strings and peaks near 18.4 GB.

         for (size_t i = 0; i < map.subcnt(); ++i) {
            map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
               auto it = I[i];
               for (auto& x : set)
                  *it++ = std::make_pair(std::move(const_cast<str_type&>(x.first)), x.second);
               // reset the set (no longer needed) to reclaim memory early
               set = map_str_int_type::EmbeddedSet();
            });
         }

I respect your decision if you prefer leaving the line commented out, long strings mode (line 119).

// #define MAX_STR_LEN_L (size_t) 12

Or default to fixed-length, uncomment line.

#define MAX_STR_LEN_L (size_t) 12

Here too, I respect your decision disabled or enabled (line 46). I enabled it locally to validate parallel sort. Disabled takes 58.0 seconds for "vector stable sort" (long strings mode), or 21.5 seconds (fixed-length mode).

#define USE_BOOST_PARALLEL_SORT 1

Testing was done on a 32 core (64 logical threads), AMD Ryzen Threadripper 3970X machine, DDR4 memory 3600 MHz.

$ ./llil4map in/big* in/big* in/big* | cksum
llil4map start
use OpenMP
use boost sort       without reset      with reset
get properties         9.035 secs        9.004 secs
map to vector          3.068 secs        2.010 secs  <--- reset, minimum peak memory consumption
vector stable sort     2.972 secs        2.988 secs
write stdout           0.639 secs        0.563 secs
total time            15.716 secs       14.567 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

greg7mdp commented 5 months ago

Thanks @marioroy , I reverted the defaults to the way you had them, and also added the reset the set. On my system AMD® Ryzen 9 7950x 16-core processor × 32 with DDR5 memory, I get the following, which is not the same as you:

~/gi/g/parallel-hashmap/build_gcc12_rel/llil_utils master *1 +1 !2 ?4 ❯ ./run_llil4map                                                                    19s
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties        12.296 secs
map to vector          1.300 secs
vector stable sort     2.133 secs
write stdout           3.247 secs
total time            18.978 secs
    count lines     970195200
    count unique    200485780
3341100189 1811165988
~/gi/g/parallel-hashmap/build_gcc12_rel/llil_utils master *1 +1 !2 ?4 ❯ cat ./run_llil4map                                                                19s
#!/bin/bash

cd tmp
../../ex_llil4map big* big* big* | cksum

marioroy commented 5 months ago

That is great performance for the AMD 7950x CPU. The "write stdout" taking 3.2 seconds is likely the L3 size difference between the two CPUs. I am providing the following output, because the 7950x CPU is amazing. In other words, my system is unable to reach 2x faster total time for being 64 logical cores. Building with clang++ runs faster for me.

It is impressive that "get properties" for the 7950x CPU completes in 12.3 seconds. Wow!

g++

$ g++ -o llil4map -std=c++20 -fopenmp -Wall -O3 llil4map.cc -I./parallel-hashmap
$ ./llil4map in/big* in/big* in/big* | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties         7.496 secs
map to vector          0.965 secs
vector stable sort     1.740 secs
write stdout           0.657 secs
total time            10.859 secs
    count lines     970195200
    count unique    200483043

clang++

$ clang++ -o llil4map -std=c++20 -fopenmp -Wall -O3 llil4map.cc -I./parallel-hashmap
$ ./llil4map in/big* in/big* in/big* | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties         7.496 secs
map to vector          0.791 secs
vector stable sort     1.137 secs
write stdout           0.591 secs
total time            10.017 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

Last year, I tried custom memory allocation. I was trying to reduce peak memory consumption to no avail, because it greatly impacted "write stdout". Thanks for the reset suggestion. Memory consumption no longer peaks high.

greg7mdp commented 5 months ago

It is weird that I don't have the same result for count unique as you do. Maybe a difference in gen_llil.pl?

marioroy commented 5 months ago

No, not weird at all. Because of the random generator script. This is normal for unique count to differ per machine or whenever refreshing the input files. If you run gen_files again, it is likely for unique count to differ.

Do you want to add to .gitignore, the tmp dir?

/examples/llil_utils/tmp

greg7mdp commented 5 months ago

Sure, will do, even though in my case I generate the files in my build directory.

I tried with clang-16 and indeed it is also faster on my system, esp the write stdout.

lil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties        12.873 secs
map to vector          1.370 secs
vector stable sort     1.862 secs
write stdout           1.177 secs
total time            17.284 secs
    count lines     970195200
    count unique    200478308

marioroy commented 5 months ago

Sure, will do, even though in my case I generate the files in my build directory.

Oh, I assumed wrongly the location.

I tried with clang-16 and indeed it is also faster on my system, esp the write stdout.

That is amazing.

I checked and you have the same defaults. Mindfulness for users and why default fixed-length is important? So the application can run with lesser memory consumption, no surprises on machines with 8 ~ 16 GB. The reset was the missing piece.

#define USE_BOOST_PARALLEL_SORT 1
#define MAX_STR_LEN_L (size_t) 12

Folks may run with a partial list, not all 92 files. For example just a i.e. biga* biga* biga*, or [ab] i.e. big[ab]* big[ab]* big[ab]*.

greg7mdp commented 5 months ago

I have a question about the data files. They contain words followed by a number. Am I right to assume this number is always positive? I'll see if I can make get_properties a little bit faster, this is fun :-)

Also I added a couple minor simplifications in the code, see this

marioroy commented 5 months ago

Am I right to assume this number is always positive?

Yes, always positive is certain.

this is fun :-)

It's mind boggling. We saw that clang++ is faster than g++. Also, -std=c++20 is faster than -std=c++17. Why did I stop at 92 input files? Running big* big* big* handles approximately 1 billion lines, and 200 million unique keys. Testing big* once is mostly insertions into the map. Testing big* big* big* is combination insertion/updates. Testing smaller input is possible simply with suffix biga* biga* biga*. Then, there is fixed-length mode and long strings. I'm grateful for reset.

The application runs parallel at each level (1) input, (2) map to vector, (3) sorting, and (4) output. Crazy fast. The coolness factor is the phmap library managing the many sub maps internally (1 << 12 = 4096). Thank you for allowing a custom mutex class. The spinlock_mutex class improves parallel insertions/updates.

Randomness is beautiful. With the many sub maps, it's less likely for threads to be using the same map/mutex in a given time window or fraction of a second. If they do, no problems. The spinlock_mutex class has low overhead.

Also I added a couple minor simplifications in the code

I had no idea about set.size(). I appreciate learning from a master. Ah, the clear/swap lines are not needed. Because, the map construction is inside a block + the reset works so well. Oh, the sorting simplifications. That's amazing.

Thank you, @greg7mdp.

greg7mdp commented 5 months ago

I'll play with it some more later. It is an interesting piece of code for sure!

greg7mdp commented 5 months ago

Hey @marioroy , just wondering, if your solution the fastest one that exists for this problem?

marioroy commented 5 months ago

if your solution the fastest one that exists for this problem?

Hi, @greg7mdp I am not aware of anything faster.

The llil4emh emhash demonstration is the fastest for fixed-length keys. For long strings or variable length keys, llil4map and llil4emh perform similarly. The map demonstrations scale to many CPUs cores with ease. These are the fastest beyond 12 CPU threads.

There is the llil4vec implementation. This appends to a vector, no maps. So, may consume a lot of memory, particularly repeating keys. It may run faster up to some number of cores, but does not scale well beyond that. If daring to run llil4vec, the max you can do for 32 GB is big* big* big* and only fixed-length=12 max. It peaks near 28 GB. So, be careful. It may cause swapping.

llil4map peaks less than 8 GB for 200 million keys, fixed-length=12.

I created a backup plan, if the number of unique keys were to exceed memory capacity :-) Run with NUM_MAPS=128 maximum, default 32.

llil4tkh tkrzw::ShardDBM demonstration. The hash dbms are managed by the C++ library.
llil4tkh2 tkrzw::HashDBM demonstration. The hash dbms are managed by the application.

greg7mdp commented 5 months ago

Wow, thanks for all the information! You seem to have spent a lot of time on this. Is this something you are doing for work, or just for fun?

marioroy commented 5 months ago

Is this something you are doing for work, or just for fun?

For fun actually. I wondered if possible to solve the challenge. A Perl Monk "eyepopslikeamosquito" tried the vector approach, stating memory is cheap. My solutions took some time (new to C++). I chose the path that preferred low memory consumption.

Your phmap C++ library made this possible (emhash as well). OpenMP made parallel possible. Chunk IO and spinlock mutex enables scaling to many CPU cores. The reset capability ensures low memory consumption during map to vector. I had no idea there was a way to do so. Thank you, for your tip. Boost's block_indirect_sort allows many threads and fast. OpenMP's ordered directive works, allowing parallel output, and orderly.

marioroy commented 5 months ago

There is one more problem to resolve. That is for long strings or variable length keys.

The std::string (std::basic_string) container is not memory efficient, unfortunately. In long strings mode, llil4map peaks 17.8 GB for the same input files (previously, 18.4 GB without the reset). That's horrendous and more than doubles fixed-length mode. It is much worst for the llil4vec vector demonstration. That one peaks 58 GB, requiring a VM or machine with 64 GB.

For comparison to llil4map, the llil4emh emhash implementation peaks 20.0 GB for long strings mode. It computes the hash_value once only, and stores it with the key.

marioroy commented 5 months ago

I made a memory efficient version llil4map2, using a vector of pointers, to the phmap key-value pairs. That met having to construct the phmap outside the block, and no reset. This results in sorting taking longer, possibly due to CPU cache misses. However, memory consumption is significantly less.

200mil unique keys   llil4map   llil4map2
------------------   --------   --------
fixed-length=12      ~ 8.0 GB     6.0 GB
long-strings          17.8 GB    12.3 GB

I removed the limit for the number of threads for better performance, sorting and output. This is helpful if RAM data is scattered, not cache friendly.

greg7mdp commented 5 months ago

There is one more problem to resolve. That is for long strings or variable length keys.

Give me a couple days, I'll provide a solution for that (fixed size string which also supports long strings and is as memory efficient as possible)

greg7mdp commented 5 months ago

Hi @marioroy ,

I added a new type string_cnt which store a short string if its length is < 11 bytes, otherwise use a pointer, so the program works for any string length, but uses little memore for short string, and even for long strings it uses less space than std::string.

I also used a set instead of a map, storing the string_cnt directly, so that we know that it will use 16 bytes per entry for short strings.

Some other small changes and cleanup, I hope you don't mind.

see this

marioroy commented 5 months ago

Hi @greg7mdp,

Wow! I gave it a spin on this machine. This peaks at 6.9 GB. The lowest memory consumption thus far for supporting long strings.

$ ./llil4mapg big* big* big* | cksum
llil4map start
use OpenMP
use boost sort
get properties         8.152 secs
map to vector          0.885 secs
vector stable sort     2.764 secs
write stdout           0.696 secs
total time            12.499 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

At your convenience, please make one tiny change to allow parallel sort to run on all CPUs threads, i.e. no limit. Replace std::min(nthds, 32) with nthds.

$ ./llil4mapg big* big* big* | cksum
llil4map start
use OpenMP
use boost sort
get properties         8.104 secs
map to vector          0.894 secs
vector stable sort     1.859 secs   <--- improvement, okay to utilize all CPU threads
write stdout           0.723 secs
total time            11.581 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

I will study string_cnt. The awesome part is not having to specify fixed-length or long strings. It's automatic.

greg7mdp commented 5 months ago

Hi @marioroy , I just made that update to remove the 32 thread limit.

marioroy commented 5 months ago

I reached out to eyepopslikeamosquito, to let him know. He triggered my curiosity in another thread, to making an attempt at the original Long List is Long challenge. This became a great opportunity for learning OpenMP using C++.

Thank you @greg7mdp, for the string_cnt solution.

marioroy commented 5 months ago

From string_cnt, "For larger strings, uses 16 bytes + strlen(string) + 1"

Recently, I came across a stack overflow thread. The top answer contains a string class involving a union.

class string {
public:
    // all 83 member functions
private:
    size_type m_size;
    union {
        class {
            // This is probably better designed as an array-like class
            std::unique_ptr<char[]> m_data;
            size_type m_capacity;
        } m_large;
        std::array<char, sizeof(m_large)> m_small;
    };
};

I will try a union in string_cnt. Ah, so that means cnt must come first in the structure. Is this not worth the effort? Curiosity wonders about how much lower memory consumption will reduce :-)

struct string_cnt {
   uint32_t  cnt;
   union {
      char *    str;
      char      extra[4];
   };
   ...
};

marioroy commented 5 months ago

Never mind. A union will not work, because extra[3] is a boolean condition whether valid string pointer. For long strings, only 3 chars is wasted space. Have I misunderstood the clever string_cnt structure?

greg7mdp commented 5 months ago

Well, the current string_cnt has already the first 12 bytes functioning as a union equivalent to:

union {
   char *str;
   char immediate[12];
};

so the first 12 bytes contain either a pointer to a long string, or the immediate string.

There would be no saving if we made the immediate[12] an immediate[8] to be the same size as the pointer, as the whole struct including the count would be 12 bytes, but when stored in a vector or a set the next struct would be aligned to an 8 byte boundary, the the 4 bytes apparently saved would be left unused.

And yes, we need a way to know whether the struct stores a pointer or an immediate string. I use the last byte extra[3], and when it is 0 (meaning immediate string), the zero can also serve as the null terminator of the string (since we don't store its size), so we can use the full 11 other bytes for useful characters.

greg7mdp commented 5 months ago

Pointers have to be aligned on a 8 byte boundary. So if you were to put the 4 byte cnt first, the next 4 bytes would be wasted because the union would have to start on an 8 byte boundary.

greg7mdp commented 5 months ago

I don't think you can save more space in the struct itself. But in the case of long strings, if they are all the same size, you can use a custom allocator like boost::pool_allocator, but that would not work in the general case where the strings are of different sizes.

marioroy commented 5 months ago

Blessings and grace, @greg7mdp. Thank you for the extra clarity. The string_cnt struct is clever.

Pointers have to be aligned on a 8 byte boundary.

Just earlier, I tried (for fun) putting cnt first and wondered about the assertions complaining. Now I understand.

What if cnt contained both flag (1 bit, indicating a pointer is used) and count (31 bits) using masking? Will that help remove the unused gap for long strings. Doing so may improve sorting performance.

Note: For long strings, I really mean variable length strings. Thanks, boost::pool_allocator will not work.

greg7mdp commented 5 months ago

If you want the same struct to work for both long and short strings, as we do I think, it has to contain at least a pointer (8 bytes) + a count.

If the count was never greater than 8, we could store it in the 3 lower bits of the pointer (which are always 0), so the pointer + count would fit in 8 bytes, so we could have either a 7 byte immediate string + count, or a pointer + count.

If the count can be greater than 8, then we need more than 8 bytes, so 16 bytes is the minimum our struct will occupy (for alignment), unless we are even more tricky by moving the pointers to an aligned address before dereferencing them.

But honestly, I don't see how you can save more than a few bytes, and then also you reduce the size of immediate strings that are supported.

marioroy commented 5 months ago

Please, disregard. So... "Instead of extra[3], store the flag (1 bit) indicating whether using a pointer in the cnt variable. 31 bits is plenty big for count" will result in no savings?

greg7mdp commented 5 months ago

No problem, it is not obvious to understand if C++ is not your main language for sure.

marioroy commented 5 months ago

string_cnt is clever. My apologies for my ramblings.

I see it clearly, right here :)

   string_cnt(string_cnt&& o) noexcept {
      if (o.extra[3]) {
         str = o.str;
         o.str = nullptr;
         extra[3] = 1;
      } else {
         std::strcpy((char *)(&str), o.get());
         extra[3] = 0;
      }
      cnt = o.cnt;
   }

greg7mdp commented 5 months ago

This is a move constructor, so this is stealing the data from o. If o has a pointer to a long string, we can just copy the pointer, note that we have a pointer in extra[3], and set o.str to null so o's destructor will not free the memory.

If o contains a small string, we just copy it and mark that we have a small string as well.

I could add some comments.

greg7mdp commented 5 months ago

I think I have an idea to make get_properties faster. I'll try to check something tomorrow. This is a fun problem!

marioroy commented 5 months ago

I understand your string_cnt fully. Thank you. The magic lies in the set method. The strcpy may span the two variables str and extra up to 3 chars (4th char is a flag).

         auto len = std::strlen(s);
         if (len >= buffsz) {
            str = new char[len+1];
            std::strcpy(str, s);
            extra[3] = 1;
         } else {
            std::strcpy((char *)(&str), s);
            extra[3] = 0;
         }

My favorite line std::strcpy((char *)(&str), s), union-like behavior str and extra[3 chars max]. Thus, 11 chars max. There's more cleverness :-) The extra[3] is dual-purpose. A flag (indicating a pointer) or null-terminating for 11-char small string.

And the importance of o.str = nullptr in the move constructor, after stealing the data.

marioroy commented 5 months ago

Your beautification of the include directives put me to work :-) For consistency reason, I thought to beautify the include directives as well. I created a thread at Perl Monks to let folks know: https://perlmonks.org/?node_id=11158711

I brought back llil4umap (unordered_map variant). Your tip for releasing memory immediately resolved the long cleanup delay in that one. So, the umap version is nice to have for comparing the standard container vs the powerful phmap library.

marioroy commented 5 months ago

@greg7mdp Your solution is superb. Thanks many, I learned a lot. The string_cnt struct boggles my mind on its clever design. You helped bring back sanity in my life. :-) Now, I no longer think about the Long List is Long challenge.

Blessings and grace,

marioroy commented 5 months ago

@greg7mdp There is extra global cleanup time, exiting the llil4map + substring_cnt program.

It dawned on me to do a testing involving mixed length. I created long?? input files containing 6 chars or 15 chars. Basically, alternating 200 12 1 or 200 3 1 arguments to gen-llil.pl.

# create 26 random files
for n in $(perl -le "print for 'aa'..'az'"); do
   perl gen-llil.pl long$n 200 12 1
   perl shuffle.pl long$n >1; mv 1 long$n
done &

# create 26 random files
for n in $(perl -le "print for 'ba'..'bz'"); do
   perl gen-llil.pl long$n 200 3 1
   perl shuffle.pl long$n >2; mv 2 long$n
done &

# create 26 random files
for n in $(perl -le "print for 'ca'..'cz'"); do
   perl gen-llil.pl long$n 200 12 1
   perl shuffle.pl long$n >3; mv 3 long$n
done &

# create 14 random files (total 92 files)
for n in $(perl -le "print for 'da'..'dn'"); do
   perl gen-llil.pl long$n 200 3 1
   perl shuffle.pl long$n >4; mv 4 long$n
done &

wait

For this round of testing, the UNIX time is captured for detecting any extra global/gc cleanup time. That is the case for substring_cnt.

1. llil4map_greg   - Greg's llil4map version with string_cnt
2. llil4map_long   - llil4map with the MAX_STR_LEN_L line commented out
3. llil4map_fix16  - llil4map with MAX_STR_LEN_L 16

llil4map_greg

$ time ./llil4map_greg in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties        12.370 secs
map to vector          1.670 secs
vector stable sort     9.336 secs
write stdout           3.929 secs
total time            27.307 secs
    count lines     970195200
    count unique    295748977
3333632426 4307239210

real    0m43.433s     <--- 16.126 seconds extra
user    23m22.291s
sys     0m42.527s

llil4map_long

WARNING: Do not run llil4map_long (peaks 32 GB), or process longa longa longa* (a suffix) instead.

$ time ./llil4map_long in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties        10.232 secs
map to vector          3.249 secs
vector stable sort     4.737 secs
write stdout           1.340 secs
total time            19.560 secs
    count lines     970195200
    count unique    295748977
3333632426 4307239210

real    0m20.269s
user    13m21.889s
sys     0m24.138s

llil4map_fix16

$ time ./llil4map_fix16 in/long* in/long* in/long* | cksum
llil4map (fixed string length=16) start
use OpenMP
use boost sort
get properties         8.672 secs
map to vector          1.553 secs
vector stable sort     1.931 secs
write stdout           1.476 secs
total time            13.633 secs
    count lines     970195200
    count unique    295748977
3333632426 4307239210

real    0m13.785s
user    10m23.834s
sys     0m13.176s

greg7mdp commented 5 months ago

Interesting!

What happens is when you use std::string, the gnu standard library uses the same trick I did and does not allocate memory for strings up to 16 characters long (it is called the short string optimization or sso).

So in your test, with the mix of 6 character strings and 15 character string, you didn't allocate memory for strings whether you commented MAX_STR_LEN_L or not.

However, using string_cnt, that same optimization is limited to 11 bytes.

Probably if you tried with a mix 6/11 characters, or 6/17 characters, the difference would be less.

The aim with string_cnt was to reduce memory usage. indeed for string sizes between from 12 to 15 characters it will be slower.

marioroy commented 5 months ago

@greg7mdp Thank you for the clarity about std::string sso. I wanted testing to be apples-to-apples comparison.

What happens is when you use std::string ... (it is called the short string optimization or sso).

I will re-create 19 length (to exeed sso limit), by passing alternating 200 16 1 or 200 3 1 arguments to gen-llil.pl. Well, to see if std::string does the same thing i.e. long GC or global cleanup when the program exits. That was the idea of testing, check overall behavior for dynamic memory allocation.

std::string does the same thing; long exit time due to global cleanup.

llil4map_greg

$ time ./llil4map_greg in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties        12.440 secs
map to vector          1.656 secs
vector stable sort     9.348 secs
write stdout           4.056 secs
total time            27.502 secs
    count lines     970195200
    count unique    295755152
29612263 5038456270

real    0m43.777s   <--- 16.275s of this time is from GC/global cleanup
user    23m27.969s
sys     0m43.282s

llil4map_long

$ time ./llil4map_long in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties        13.111 secs
map to vector          3.153 secs
vector stable sort    13.529 secs
write stdout           3.055 secs
total time            32.849 secs
    count lines     970195200
    count unique    295755152
29612263 5038456270

real    0m49.815s   <--- 16.966s of this time is from GC/global cleanup
user    20m59.538s
sys     0m59.482s

string_cnt shines, consuming lesser memory consumption.

llil4map_greg   peak 17.2 GB
llil4map_long   peak > 36 GB

marioroy commented 5 months ago

std::string and string_cnt, once past sso limit, result in significantly longer exit time due to global cleanup. If only string_cnt allowed one to specify the fix-length limit, for MAX_STR_LEN_L-like capability. A key size of 11 chars is quite small. Chuma, the OP in the LLiL PerlMonks thread, did not provide us the max key length.

@greg7mdp Akin to { sha1sum, sha224sum, sha256sum, sha384sum, sha512sum } naming, one can build llil4map supporting fixed-length mode { llil4map_12, llil4map_16, llil4map_20, llil4map_24, llil4map_30 }, and llil4map_dyn (using static_cnt).

The Chuma LLiL challenge is solved. :-)

greg7mdp commented 5 months ago

If only string_cnt allowed one to specify the fix-length limit, for MAX_STR_LEN_L-like capability.

It is indeed possible. I have previously modified the version of string_cnt to use std::string_view, but here is a version which is templatized to support various sso sizes:

// ---------------------------------------------------------------------------------------------
// Stores a string + a count
// For strings up to N-1 bytes, total space used is N + 4 bytes
// For larger strings, uses N+4 bytes + strlen(string) + 1
//
// invariants
//    if  extra[mark_idx], str is a valid string pointer
//    if !extra[mark_idx], the N bytes starting at (const char *)(&str) store a null_terminated string
// ---------------------------------------------------------------------------------------------
template<size_t N>
struct string_cnt_t {
   using uint_t = uint32_t;

   static_assert(N >= 12);
   static constexpr size_t extra_sz = N - sizeof(char*);
   static constexpr size_t mark_idx = extra_sz - 1;

   char *   str;
   char     extra[extra_sz];
   uint_t   cnt;

   static constexpr size_t buffsz = sizeof(str) + sizeof(extra);

   string_cnt_t() : str{nullptr}, extra{0}, cnt{0} {}

   string_cnt_t(std::string_view s, uint_t c = 0) : str(nullptr), cnt(c) { set(s); }

   ~string_cnt_t() { free(); }

   string_cnt_t(const string_cnt_t& o) {
      set(o.get());
   }

   string_cnt_t(string_cnt_t&& o) noexcept {
      if (o.extra[mark_idx]) {
         str = o.str;
         o.str = nullptr;
         extra[mark_idx] = 1;
      } else {
         std::strcpy((char *)(&str), o.get());
         extra[mark_idx] = 0;
      }
      cnt = o.cnt;
   }

   string_cnt_t& operator=(const string_cnt_t& o) {
      free();
      set(o.get());
      cnt = o.cnt;
      return *this;
   }

   string_cnt_t& operator=(string_cnt_t&& o) noexcept {
      free();
      new (this) string_cnt_t(std::move(o));
      return *this;
   }

   std::strong_ordering operator<=>(const string_cnt_t& o) const { return std::strcmp(get(), o.get()) <=> 0; }
   bool operator==(const string_cnt_t& o) const { return std::strcmp(get(), o.get()) == 0; }

   std::size_t hash() const {
      auto s = get();
      std::string_view sv {s};
      return std::hash<std::string_view>()(sv);
   }

   const char *get() const { return extra[mark_idx] ? str : (const char *)(&str); }

private:
   void free() { if (extra[mark_idx]) { delete [] str; str = nullptr; } }

   void set(std::string_view s) {
      static_assert(buffsz == 12);
      static_assert(offsetof(string_cnt_t, cnt) == (intptr_t)buffsz);
      static_assert(sizeof(string_cnt_t) == 16);

      assert(!extra[mark_idx] || !str);
      if (s.empty())
         std::memset(&str, 0, buffsz);
      else {
         auto len = s.size();
         if (len >= buffsz) {
            str = new char[len+1];
            std::memcpy(str, s.data(), len);
            str[len] = 0;
            extra[mark_idx] = 1;
         } else {
            std::memcpy((char *)(&str), s.data(), len);
            ((char *)&str)[len] = 0;
            extra[mark_idx] = 0;
         }
      }
   }

   void set(const char *s) {
      set(std::string_view{s});
   }
};

using string_cnt = string_cnt_t<12>;

namespace std {
   template<> struct hash<string_cnt> {
      std::size_t operator()(const string_cnt& v) const noexcept { return v.hash(); };
   };
}

greg7mdp commented 5 months ago

PS: I got get_properties to be faster... takes 8.8s on my machine with the 6 character strings.

greg7mdp / parallel-hashmap

Ability to reset the inner sub map #238