Closed marioroy closed 5 months ago
Thanks for the great explanation @marioroy .
Can you do the reset inside the with_submap_n
, as in:
for (size_t i = 0; i < map.subcnt(); ++i) {
map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
auto it = I[i];
for (auto& x : set)
*it++ = std::make_pair(std::move(x.first), x.second);
set = map_str_int_type::EmbeddedSet(); // <------------------------------
});
}
Thanks. Per your suggestion (removing const
) and inserting line.
for (size_t i = 0; i < map.subcnt(); ++i) {
map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
auto it = I[i];
for (auto& x : set)
*it++ = std::make_pair(std::move(x.first), x.second);
set = map_str_int_type::EmbeddedSet();
});
}
The clang++
compiler fails.
In file included from llil4map.cc:51:
./parallel-hashmap/parallel_hashmap/phmap.h:3431:9: error: no matching function for call to object of type '(lambda at llil4map.cc:415:32)'
3431 | fCallback(set);
| ^~~~~~~~~
llil4map.cc:415:17: note: in instantiation of function template specialization 'phmap::priv::parallel_hash_set<12, phmap::priv::raw_hash_set, spinlock_mutex, phmap::priv::FlatHashMapPolicy<std::basic_string<char>, unsigned int>, phmap::priv::StringHashEqT<char>::Hash, phmap::priv::StringHashEqT<char>::Eq, std::allocator<std::pair<const std::basic_string<char>, unsigned int>>>::with_submap<(lambda at llil4map.cc:415:32)>' requested here
415 | map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
| ^
llil4map.cc:415:32: note: candidate function not viable: 1st argument ('const EmbeddedSet' (aka 'const raw_hash_set<phmap::priv::FlatHashMapPolicy<std::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned int>, phmap::priv::StringHashEqT<char>::Hash, phmap::priv::StringHashEqT<char>::Eq, std::allocator<std::pair<const std::basic_string<char, std::char_traits<char>, std::allocator<char>>, unsigned int>>>')) would lose const qualifier
415 | map.with_submap(i, [&](map_str_int_type::EmbeddedSet& set) {
| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Sorry, my bad.
After the loop, and instead if map.reset_inner(i);
, use the following which is equivalent to your reset_inner
:
map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
set = map_str_int_type::EmbeddedSet();
});
Thank you, @greg7mdp. Your suggestion works.
for (size_t i = 0; i < map.subcnt(); ++i) {
map.with_submap(i, [&](const map_str_int_type::EmbeddedSet& set) {
auto it = I[i];
for (auto& x : set)
*it++ = std::make_pair(std::move(x.first), x.second);
});
// The sub map is no longer needed. Reset the set to reclaim memory.
map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
set = map_str_int_type::EmbeddedSet();
});
}
Notice "map to vector" completing in 2 seconds. This was taking 3 seconds, previously.
$ ./llil4map in/big* in/big* in/big* | cksum
llil4map start
use OpenMP
use boost sort
get properties 8.946 secs
map to vector 2.018 secs <-- Yeah! Thank you!
vector stable sort 2.960 secs
write stdout 0.589 secs
total time 14.514 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
Hey @marioroy , I just tried your lill4map.cc
myself. Would you mind if I added it to phmap as an example (with your attribution comment at the top of lill4map.cc
of course).
Which parameters did you use for gen-llil.pl
. I used the ones from the comment and I got:
❯ NUM_THREADS=3 ../ex_lill4map big1.txt big2.txt big3.txt >out.txt
llil4map (fixed string length=12) start
use OpenMP
don't use boost sort
get properties 0.695 secs
map to vector 0.093 secs
vector stable sort 1.316 secs
write stdout 0.061 secs
total time 2.166 secs
count lines 10545600
count unique 10368458
Wow! Do not mind. llil4map.cc is final.
Thank you for your help, @greg7mdp; (1) some time ago, found one-off error in chunking; (2) at the time, helped with map to vector allowing parallel; (3) and now, helped ensure minimum memory consumption. Blessings and grace.
That's it, just as written in the comments for generating the input files.
perl gen-llil.pl big1.txt 200 3 1
perl gen-llil.pl big1.txt 200 3 1
perl gen-llil.pl big1.txt 200 3 1
I found my script for making 92 input files + shuffled (requires 2.8 GB). On my machine (where llil4map.cc resides), in
is a symbolic link to /data/input
. Change the path accordingly.
#!/bin/bash
# Script for generating input files for llil4map.cc.
mkdir -p /data/input
cp gen-llil.pl shuffle.pl /data/input
cd /data/input
# create 26 random files
for n in $(perl -le "print for 'aa'..'az'"); do
perl gen-llil.pl big$n 200 3 1
perl shuffle.pl big$n >1; mv 1 big$n
done &
# create 26 random files
for n in $(perl -le "print for 'ba'..'bz'"); do
perl gen-llil.pl big$n 200 3 1
perl shuffle.pl big$n >2; mv 2 big$n
done &
# create 26 random files
for n in $(perl -le "print for 'ca'..'cz'"); do
perl gen-llil.pl big$n 200 3 1
perl shuffle.pl big$n >3; mv 3 big$n
done &
# create 14 random files (total 92 files)
for n in $(perl -le "print for 'da'..'dn'"); do
perl gen-llil.pl big$n 200 3 1
perl shuffle.pl big$n >4; mv 4 big$n
done &
wait
The default mode in llil4map.cc
is fixed string length = 12
. Commenting out line 119
becomes variable length keys (using std::basic_string). This requires 16 GB to process the 92 input files when commented out. Otherwise, much less for fixed-length keys.
// #define MAX_STR_LEN_L (size_t) 12
My PerlMonks friend eyepopslikeamosquito helped me learn C++.
I enjoy parallel processing. So naturally, I tried: (1) chunk IO in get properties using spinlock_mutex
; (2) map to vector, this too is parallel; (3) parallel sort; (4) finally, parallel output but orderly.
Wow! Do not mind. llil4map.cc is final.
Thanks, I'll add it as an example then, it is a great example!
Added! Looks like you did an amazing job learning C++!
I saw a small perf. issue in lill4map.cc
, where using the non-mutable with_submap
does not allow to std::move
the string out of the map. Could be an issue if long strings are used (and if not using the fixed string).
Here is my updated version.
Thanks. I learned also, reading phmap_fwd_decl.h
and your examples.
I tried your updated version and added line to reset the set, especially beneficial for long strings. Without it, the "map to vector" takes 3 seconds for long strings and peaks near 18.4 GB.
for (size_t i = 0; i < map.subcnt(); ++i) {
map.with_submap_m(i, [&](map_str_int_type::EmbeddedSet& set) {
auto it = I[i];
for (auto& x : set)
*it++ = std::make_pair(std::move(const_cast<str_type&>(x.first)), x.second);
// reset the set (no longer needed) to reclaim memory early
set = map_str_int_type::EmbeddedSet();
});
}
I respect your decision if you prefer leaving the line commented out, long strings mode (line 119).
// #define MAX_STR_LEN_L (size_t) 12
Or default to fixed-length, uncomment line.
#define MAX_STR_LEN_L (size_t) 12
Here too, I respect your decision disabled or enabled (line 46). I enabled it locally to validate parallel sort. Disabled takes 58.0 seconds for "vector stable sort" (long strings mode), or 21.5 seconds (fixed-length mode).
#define USE_BOOST_PARALLEL_SORT 1
Testing was done on a 32 core (64 logical threads), AMD Ryzen Threadripper 3970X machine, DDR4 memory 3600 MHz.
$ ./llil4map in/big* in/big* in/big* | cksum
llil4map start
use OpenMP
use boost sort without reset with reset
get properties 9.035 secs 9.004 secs
map to vector 3.068 secs 2.010 secs <--- reset, minimum peak memory consumption
vector stable sort 2.972 secs 2.988 secs
write stdout 0.639 secs 0.563 secs
total time 15.716 secs 14.567 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
Thanks @marioroy , I reverted the defaults to the way you had them, and also added the reset the set
.
On my system AMD® Ryzen 9 7950x 16-core processor × 32 with DDR5 memory
, I get the following, which is not the same as you:
~/gi/g/parallel-hashmap/build_gcc12_rel/llil_utils master *1 +1 !2 ?4 ❯ ./run_llil4map 19s
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties 12.296 secs
map to vector 1.300 secs
vector stable sort 2.133 secs
write stdout 3.247 secs
total time 18.978 secs
count lines 970195200
count unique 200485780
3341100189 1811165988
~/gi/g/parallel-hashmap/build_gcc12_rel/llil_utils master *1 +1 !2 ?4 ❯ cat ./run_llil4map 19s
#!/bin/bash
cd tmp
../../ex_llil4map big* big* big* | cksum
That is great performance for the AMD 7950x CPU. The "write stdout" taking 3.2 seconds is likely the L3 size difference between the two CPUs. I am providing the following output, because the 7950x CPU is amazing. In other words, my system is unable to reach 2x faster total time for being 64 logical cores. Building with clang++
runs faster for me.
It is impressive that "get properties" for the 7950x CPU completes in 12.3 seconds. Wow!
g++
$ g++ -o llil4map -std=c++20 -fopenmp -Wall -O3 llil4map.cc -I./parallel-hashmap
$ ./llil4map in/big* in/big* in/big* | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties 7.496 secs
map to vector 0.965 secs
vector stable sort 1.740 secs
write stdout 0.657 secs
total time 10.859 secs
count lines 970195200
count unique 200483043
clang++
$ clang++ -o llil4map -std=c++20 -fopenmp -Wall -O3 llil4map.cc -I./parallel-hashmap
$ ./llil4map in/big* in/big* in/big* | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties 7.496 secs
map to vector 0.791 secs
vector stable sort 1.137 secs
write stdout 0.591 secs
total time 10.017 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
Last year, I tried custom memory allocation. I was trying to reduce peak memory consumption to no avail, because it greatly impacted "write stdout". Thanks for the reset suggestion. Memory consumption no longer peaks high.
It is weird that I don't have the same result for count unique
as you do. Maybe a difference in gen_llil.pl?
No, not weird at all. Because of the random generator script. This is normal for unique count to differ per machine or whenever refreshing the input files. If you run gen_files
again, it is likely for unique count to differ.
Do you want to add to .gitignore, the tmp dir?
/examples/llil_utils/tmp
Sure, will do, even though in my case I generate the files in my build directory.
I tried with clang-16
and indeed it is also faster on my system, esp the write stdout
.
lil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties 12.873 secs
map to vector 1.370 secs
vector stable sort 1.862 secs
write stdout 1.177 secs
total time 17.284 secs
count lines 970195200
count unique 200478308
Sure, will do, even though in my case I generate the files in my build directory.
Oh, I assumed wrongly the location.
I tried with clang-16 and indeed it is also faster on my system, esp the write stdout.
That is amazing.
I checked and you have the same defaults. Mindfulness for users and why default fixed-length is important? So the application can run with lesser memory consumption, no surprises on machines with 8 ~ 16 GB. The reset was the missing piece.
#define USE_BOOST_PARALLEL_SORT 1
#define MAX_STR_LEN_L (size_t) 12
Folks may run with a partial list, not all 92 files. For example just a
i.e. biga* biga* biga*
, or [ab]
i.e. big[ab]* big[ab]* big[ab]*
.
I have a question about the data files. They contain words followed by a number. Am I right to assume this number is always positive?
I'll see if I can make get_properties
a little bit faster, this is fun :-)
Also I added a couple minor simplifications in the code, see this
Am I right to assume this number is always positive?
Yes, always positive is certain.
this is fun :-)
It's mind boggling. We saw that clang++
is faster than g++
. Also, -std=c++20
is faster than -std=c++17
. Why did I stop at 92 input files? Running big* big* big*
handles approximately 1 billion lines, and 200 million unique keys. Testing big*
once is mostly insertions into the map. Testing big* big* big*
is combination insertion/updates. Testing smaller input is possible simply with suffix biga* biga* biga*
. Then, there is fixed-length mode and long strings. I'm grateful for reset.
The application runs parallel at each level (1) input, (2) map to vector, (3) sorting, and (4) output. Crazy fast. The coolness factor is the phmap
library managing the many sub maps internally (1 << 12 = 4096). Thank you for allowing a custom mutex class. The spinlock_mutex
class improves parallel insertions/updates.
Randomness is beautiful. With the many sub maps, it's less likely for threads to be using the same map/mutex in a given time window or fraction of a second. If they do, no problems. The spinlock_mutex
class has low overhead.
Also I added a couple minor simplifications in the code
I had no idea about set.size()
. I appreciate learning from a master. Ah, the clear/swap lines are not needed. Because, the map construction is inside a block + the reset works so well. Oh, the sorting simplifications. That's amazing.
Thank you, @greg7mdp.
I'll play with it some more later. It is an interesting piece of code for sure!
Hey @marioroy , just wondering, if your solution the fastest one that exists for this problem?
if your solution the fastest one that exists for this problem?
Hi, @greg7mdp I am not aware of anything faster.
The llil4emh emhash
demonstration is the fastest for fixed-length keys. For long strings or variable length keys, llil4map
and llil4emh
perform similarly. The map demonstrations scale to many CPUs cores with ease. These are the fastest beyond 12 CPU threads.
There is the llil4vec implementation. This appends to a vector, no maps. So, may consume a lot of memory, particularly repeating keys. It may run faster up to some number of cores, but does not scale well beyond that. If daring to run llil4vec
, the max you can do for 32 GB is big* big* big*
and only fixed-length=12 max. It peaks near 28 GB. So, be careful. It may cause swapping.
llil4map peaks less than 8 GB for 200 million keys, fixed-length=12.
I created a backup plan, if the number of unique keys were to exceed memory capacity :-) Run with NUM_MAPS=128
maximum, default 32.
Wow, thanks for all the information! You seem to have spent a lot of time on this. Is this something you are doing for work, or just for fun?
Is this something you are doing for work, or just for fun?
For fun actually. I wondered if possible to solve the challenge. A Perl Monk "eyepopslikeamosquito" tried the vector approach, stating memory is cheap. My solutions took some time (new to C++). I chose the path that preferred low memory consumption.
Your phmap C++ library made this possible (emhash as well). OpenMP made parallel possible. Chunk IO and spinlock mutex enables scaling to many CPU cores. The reset capability ensures low memory consumption during map to vector. I had no idea there was a way to do so. Thank you, for your tip. Boost's block_indirect_sort
allows many threads and fast. OpenMP's ordered
directive works, allowing parallel output, and orderly.
There is one more problem to resolve. That is for long strings or variable length keys.
The std::string (std::basic_string) container is not memory efficient, unfortunately. In long strings mode, llil4map
peaks 17.8 GB for the same input files (previously, 18.4 GB without the reset). That's horrendous and more than doubles fixed-length mode. It is much worst for the llil4vec
vector demonstration. That one peaks 58 GB, requiring a VM or machine with 64 GB.
For comparison to llil4map
, the llil4emh
emhash implementation peaks 20.0 GB for long strings mode. It computes the hash_value
once only, and stores it with the key.
I made a memory efficient version llil4map2, using a vector of pointers, to the phmap key-value pairs. That met having to construct the phmap outside the block, and no reset. This results in sorting taking longer, possibly due to CPU cache misses. However, memory consumption is significantly less.
200mil unique keys llil4map llil4map2
------------------ -------- --------
fixed-length=12 ~ 8.0 GB 6.0 GB
long-strings 17.8 GB 12.3 GB
I removed the limit for the number of threads for better performance, sorting and output. This is helpful if RAM data is scattered, not cache friendly.
There is one more problem to resolve. That is for long strings or variable length keys.
Give me a couple days, I'll provide a solution for that (fixed size string which also supports long strings and is as memory efficient as possible)
Hi @marioroy ,
I added a new type string_cnt
which store a short string if its length is < 11 bytes, otherwise use a pointer, so the program works for any string length, but uses little memore for short string, and even for long strings it uses less space than std::string
.
I also used a set instead of a map, storing the string_cnt
directly, so that we know that it will use 16 bytes per entry for short strings.
Some other small changes and cleanup, I hope you don't mind.
see this
Hi @greg7mdp,
Wow! I gave it a spin on this machine. This peaks at 6.9 GB. The lowest memory consumption thus far for supporting long strings.
$ ./llil4mapg big* big* big* | cksum
llil4map start
use OpenMP
use boost sort
get properties 8.152 secs
map to vector 0.885 secs
vector stable sort 2.764 secs
write stdout 0.696 secs
total time 12.499 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
At your convenience, please make one tiny change to allow parallel sort to run on all CPUs threads, i.e. no limit. Replace std::min(nthds, 32)
with nthds
.
$ ./llil4mapg big* big* big* | cksum
llil4map start
use OpenMP
use boost sort
get properties 8.104 secs
map to vector 0.894 secs
vector stable sort 1.859 secs <--- improvement, okay to utilize all CPU threads
write stdout 0.723 secs
total time 11.581 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
I will study string_cnt
. The awesome part is not having to specify fixed-length or long strings. It's automatic.
Hi @marioroy , I just made that update to remove the 32 thread limit.
I reached out to eyepopslikeamosquito, to let him know. He triggered my curiosity in another thread, to making an attempt at the original Long List is Long challenge. This became a great opportunity for learning OpenMP using C++.
Thank you @greg7mdp, for the string_cnt
solution.
From
string_cnt
, "For larger strings, uses 16 bytes + strlen(string) + 1"
Recently, I came across a stack overflow thread. The top answer contains a string class involving a union.
class string {
public:
// all 83 member functions
private:
size_type m_size;
union {
class {
// This is probably better designed as an array-like class
std::unique_ptr<char[]> m_data;
size_type m_capacity;
} m_large;
std::array<char, sizeof(m_large)> m_small;
};
};
I will try a union in string_cnt
. Ah, so that means cnt
must come first in the structure. Is this not worth the effort? Curiosity wonders about how much lower memory consumption will reduce :-)
struct string_cnt {
uint32_t cnt;
union {
char * str;
char extra[4];
};
...
};
Never mind. A union will not work, because extra[3] is a boolean condition whether valid string pointer. For long strings, only 3 chars is wasted space. Have I misunderstood the clever string_cnt
structure?
Well, the current string_cnt
has already the first 12 bytes functioning as a union equivalent to:
union {
char *str;
char immediate[12];
};
so the first 12 bytes contain either a pointer to a long string, or the immediate string.
There would be no saving if we made the immediate[12]
an immediate[8]
to be the same size as the pointer, as the whole struct including the count would be 12 bytes, but when stored in a vector
or a set
the next struct would be aligned to an 8 byte boundary, the the 4 bytes apparently saved would be left unused.
And yes, we need a way to know whether the struct stores a pointer or an immediate string. I use the last byte extra[3]
, and when it is 0 (meaning immediate string), the zero can also serve as the null terminator of the string (since we don't store its size), so we can use the full 11 other bytes for useful characters.
Pointers have to be aligned on a 8 byte boundary. So if you were to put the 4 byte cnt
first, the next 4 bytes would be wasted because the union would have to start on an 8 byte boundary.
I don't think you can save more space in the struct itself. But in the case of long strings, if they are all the same size, you can use a custom allocator like boost::pool_allocator
, but that would not work in the general case where the strings are of different sizes.
Blessings and grace, @greg7mdp. Thank you for the extra clarity. The string_cnt
struct is clever.
Pointers have to be aligned on a 8 byte boundary.
Just earlier, I tried (for fun) putting cnt
first and wondered about the assertions complaining. Now I understand.
What if cnt
contained both flag (1 bit, indicating a pointer is used) and count (31 bits) using masking? Will that help remove the unused gap for long strings. Doing so may improve sorting performance.
Note: For long strings, I really mean variable length strings. Thanks, boost::pool_allocator
will not work.
If you want the same struct to work for both long and short strings, as we do I think, it has to contain at least a pointer (8 bytes) + a count.
If the count was never greater than 8, we could store it in the 3 lower bits of the pointer (which are always 0), so the pointer + count would fit in 8 bytes, so we could have either a 7 byte immediate string + count, or a pointer + count.
If the count can be greater than 8, then we need more than 8 bytes, so 16 bytes is the minimum our struct will occupy (for alignment), unless we are even more tricky by moving the pointers to an aligned address before dereferencing them.
But honestly, I don't see how you can save more than a few bytes, and then also you reduce the size of immediate strings that are supported.
Please, disregard. So... "Instead of extra[3], store the flag (1 bit) indicating whether using a pointer in the cnt
variable. 31 bits is plenty big for count" will result in no savings?
No problem, it is not obvious to understand if C++ is not your main language for sure.
string_cnt
is clever. My apologies for my ramblings.
I see it clearly, right here :)
string_cnt(string_cnt&& o) noexcept {
if (o.extra[3]) {
str = o.str;
o.str = nullptr;
extra[3] = 1;
} else {
std::strcpy((char *)(&str), o.get());
extra[3] = 0;
}
cnt = o.cnt;
}
This is a move constructor, so this is stealing the data from o
.
If o
has a pointer to a long string, we can just copy the pointer, note that we have a pointer in extra[3]
, and set o.str
to null so o
's destructor will not free the memory.
If o
contains a small string, we just copy it and mark that we have a small string as well.
I could add some comments.
I think I have an idea to make get_properties
faster. I'll try to check something tomorrow. This is a fun problem!
I understand your string_cnt
fully. Thank you. The magic lies in the set
method. The strcpy
may span the two variables str
and extra
up to 3 chars (4th char is a flag).
auto len = std::strlen(s);
if (len >= buffsz) {
str = new char[len+1];
std::strcpy(str, s);
extra[3] = 1;
} else {
std::strcpy((char *)(&str), s);
extra[3] = 0;
}
My favorite line std::strcpy((char *)(&str), s)
, union-like behavior str and extra[3 chars max]. Thus, 11 chars max. There's more cleverness :-) The extra[3]
is dual-purpose. A flag (indicating a pointer) or null-terminating for 11-char small string.
And the importance of o.str = nullptr
in the move constructor, after stealing the data.
Your beautification of the include directives put me to work :-) For consistency reason, I thought to beautify the include directives as well. I created a thread at Perl Monks to let folks know: https://perlmonks.org/?node_id=11158711
I brought back llil4umap (unordered_map variant). Your tip for releasing memory immediately resolved the long cleanup delay in that one. So, the umap version is nice to have for comparing the standard container vs the powerful phmap library.
@greg7mdp Your solution is superb. Thanks many, I learned a lot. The string_cnt
struct boggles my mind on its clever design. You helped bring back sanity in my life. :-) Now, I no longer think about the Long List is Long challenge.
Blessings and grace,
@greg7mdp There is extra global cleanup time, exiting the llil4map
+ substring_cnt
program.
It dawned on me to do a testing involving mixed length. I created long??
input files containing 6 chars or 15 chars. Basically, alternating 200 12 1
or 200 3 1
arguments to gen-llil.pl
.
# create 26 random files
for n in $(perl -le "print for 'aa'..'az'"); do
perl gen-llil.pl long$n 200 12 1
perl shuffle.pl long$n >1; mv 1 long$n
done &
# create 26 random files
for n in $(perl -le "print for 'ba'..'bz'"); do
perl gen-llil.pl long$n 200 3 1
perl shuffle.pl long$n >2; mv 2 long$n
done &
# create 26 random files
for n in $(perl -le "print for 'ca'..'cz'"); do
perl gen-llil.pl long$n 200 12 1
perl shuffle.pl long$n >3; mv 3 long$n
done &
# create 14 random files (total 92 files)
for n in $(perl -le "print for 'da'..'dn'"); do
perl gen-llil.pl long$n 200 3 1
perl shuffle.pl long$n >4; mv 4 long$n
done &
wait
For this round of testing, the UNIX time is captured for detecting any extra global/gc cleanup time. That is the case for substring_cnt
.
1. llil4map_greg - Greg's llil4map version with string_cnt
2. llil4map_long - llil4map with the MAX_STR_LEN_L line commented out
3. llil4map_fix16 - llil4map with MAX_STR_LEN_L 16
llil4map_greg
$ time ./llil4map_greg in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties 12.370 secs
map to vector 1.670 secs
vector stable sort 9.336 secs
write stdout 3.929 secs
total time 27.307 secs
count lines 970195200
count unique 295748977
3333632426 4307239210
real 0m43.433s <--- 16.126 seconds extra
user 23m22.291s
sys 0m42.527s
llil4map_long
WARNING: Do not run llil4map_long (peaks 32 GB), or process longa longa longa* (a suffix) instead.
$ time ./llil4map_long in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties 10.232 secs
map to vector 3.249 secs
vector stable sort 4.737 secs
write stdout 1.340 secs
total time 19.560 secs
count lines 970195200
count unique 295748977
3333632426 4307239210
real 0m20.269s
user 13m21.889s
sys 0m24.138s
llil4map_fix16
$ time ./llil4map_fix16 in/long* in/long* in/long* | cksum
llil4map (fixed string length=16) start
use OpenMP
use boost sort
get properties 8.672 secs
map to vector 1.553 secs
vector stable sort 1.931 secs
write stdout 1.476 secs
total time 13.633 secs
count lines 970195200
count unique 295748977
3333632426 4307239210
real 0m13.785s
user 10m23.834s
sys 0m13.176s
Interesting!
What happens is when you use std::string
, the gnu standard library uses the same trick I did and does not allocate memory for strings up to 16 characters long (it is called the short string optimization
or sso
).
So in your test, with the mix of 6 character strings and 15 character string, you didn't allocate memory for strings whether you commented MAX_STR_LEN_L
or not.
However, using string_cnt
, that same optimization is limited to 11 bytes.
Probably if you tried with a mix 6/11 characters, or 6/17 characters, the difference would be less.
The aim with string_cnt
was to reduce memory usage. indeed for string sizes between from 12 to 15 characters it will be slower.
@greg7mdp Thank you for the clarity about std::string sso. I wanted testing to be apples-to-apples comparison.
What happens is when you use std::string ... (it is called the short string optimization or sso).
I will re-create 19 length (to exeed sso limit), by passing alternating 200 16 1
or 200 3 1
arguments to gen-llil.pl
. Well, to see if std::string
does the same thing i.e. long GC or global cleanup when the program exits. That was the idea of testing, check overall behavior for dynamic memory allocation.
std::string does the same thing; long exit time due to global cleanup.
llil4map_greg
$ time ./llil4map_greg in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties 12.440 secs
map to vector 1.656 secs
vector stable sort 9.348 secs
write stdout 4.056 secs
total time 27.502 secs
count lines 970195200
count unique 295755152
29612263 5038456270
real 0m43.777s <--- 16.275s of this time is from GC/global cleanup
user 23m27.969s
sys 0m43.282s
llil4map_long
$ time ./llil4map_long in/long* in/long* in/long* | cksum
llil4map start
use OpenMP
use boost sort
get properties 13.111 secs
map to vector 3.153 secs
vector stable sort 13.529 secs
write stdout 3.055 secs
total time 32.849 secs
count lines 970195200
count unique 295755152
29612263 5038456270
real 0m49.815s <--- 16.966s of this time is from GC/global cleanup
user 20m59.538s
sys 0m59.482s
string_cnt shines, consuming lesser memory consumption.
llil4map_greg peak 17.2 GB
llil4map_long peak > 36 GB
std::string and string_cnt, once past sso limit, result in significantly longer exit time due to global cleanup. If only string_cnt
allowed one to specify the fix-length limit, for MAX_STR_LEN_L
-like capability. A key size of 11 chars is quite small. Chuma, the OP in the LLiL PerlMonks thread, did not provide us the max key length.
@greg7mdp Akin to { sha1sum, sha224sum, sha256sum, sha384sum, sha512sum } naming, one can build llil4map supporting fixed-length mode { llil4map_12, llil4map_16, llil4map_20, llil4map_24, llil4map_30 }, and llil4map_dyn (using static_cnt).
The Chuma LLiL challenge is solved. :-)
If only
string_cnt
allowed one to specify the fix-length limit, forMAX_STR_LEN_L
-like capability.
It is indeed possible. I have previously modified the version of string_cnt
to use std::string_view
, but here is a version which is templatized to support various sso
sizes:
// ---------------------------------------------------------------------------------------------
// Stores a string + a count
// For strings up to N-1 bytes, total space used is N + 4 bytes
// For larger strings, uses N+4 bytes + strlen(string) + 1
//
// invariants
// if extra[mark_idx], str is a valid string pointer
// if !extra[mark_idx], the N bytes starting at (const char *)(&str) store a null_terminated string
// ---------------------------------------------------------------------------------------------
template<size_t N>
struct string_cnt_t {
using uint_t = uint32_t;
static_assert(N >= 12);
static constexpr size_t extra_sz = N - sizeof(char*);
static constexpr size_t mark_idx = extra_sz - 1;
char * str;
char extra[extra_sz];
uint_t cnt;
static constexpr size_t buffsz = sizeof(str) + sizeof(extra);
string_cnt_t() : str{nullptr}, extra{0}, cnt{0} {}
string_cnt_t(std::string_view s, uint_t c = 0) : str(nullptr), cnt(c) { set(s); }
~string_cnt_t() { free(); }
string_cnt_t(const string_cnt_t& o) {
set(o.get());
}
string_cnt_t(string_cnt_t&& o) noexcept {
if (o.extra[mark_idx]) {
str = o.str;
o.str = nullptr;
extra[mark_idx] = 1;
} else {
std::strcpy((char *)(&str), o.get());
extra[mark_idx] = 0;
}
cnt = o.cnt;
}
string_cnt_t& operator=(const string_cnt_t& o) {
free();
set(o.get());
cnt = o.cnt;
return *this;
}
string_cnt_t& operator=(string_cnt_t&& o) noexcept {
free();
new (this) string_cnt_t(std::move(o));
return *this;
}
std::strong_ordering operator<=>(const string_cnt_t& o) const { return std::strcmp(get(), o.get()) <=> 0; }
bool operator==(const string_cnt_t& o) const { return std::strcmp(get(), o.get()) == 0; }
std::size_t hash() const {
auto s = get();
std::string_view sv {s};
return std::hash<std::string_view>()(sv);
}
const char *get() const { return extra[mark_idx] ? str : (const char *)(&str); }
private:
void free() { if (extra[mark_idx]) { delete [] str; str = nullptr; } }
void set(std::string_view s) {
static_assert(buffsz == 12);
static_assert(offsetof(string_cnt_t, cnt) == (intptr_t)buffsz);
static_assert(sizeof(string_cnt_t) == 16);
assert(!extra[mark_idx] || !str);
if (s.empty())
std::memset(&str, 0, buffsz);
else {
auto len = s.size();
if (len >= buffsz) {
str = new char[len+1];
std::memcpy(str, s.data(), len);
str[len] = 0;
extra[mark_idx] = 1;
} else {
std::memcpy((char *)(&str), s.data(), len);
((char *)&str)[len] = 0;
extra[mark_idx] = 0;
}
}
}
void set(const char *s) {
set(std::string_view{s});
}
};
using string_cnt = string_cnt_t<12>;
namespace std {
template<> struct hash<string_cnt> {
std::size_t operator()(const string_cnt& v) const noexcept { return v.hash(); };
};
}
PS: I got get_properties
to be faster... takes 8.8s on my machine with the 6 character strings.
Hi, @greg7mdp
I tested variable length keys. It turns out the llil4map.cc demonstration consumes the most memory consumption. The reason is unable to clear or reset the inner sub map. Without this ability, memory consumption peaks 18.4 GB. Otherwise, 15.1 GB with
reset_inner
capability.Here, we get ready to insert the key,value pairs into a vector.
I added the
reset_inner
method tophmap.h
, locally on my machine.Great news!
This change enables
llil4map
to reachllil4hmap
performance, processing variable length keys. Previously, thellil4map
demonstration ran slower. Being able to clear/reset a sub map early is beneficial when the sub map is no longer needed. In the C++ snippet above, parallel threads process each sub map individually, "map to vector" below. I prefer for the map memory consumption to decrease while the vector memory increases and not wait until the end to clear the map. No issues for fixed-length keys.llil4map.cc
The sub maps are managed by the phmap library.
llil4hmap.cc
The sub maps are managed at the application level. Notice, similar times for "get properties" and "map to vector". The key
hash_value
is computed once only and stored with the key. I was curious at the time and left it this way.Now, I understand why the two
phmap
demonstrations did not perform similarly before. You may like to know how far behindemhash
. Thellil4map
is 0.9 seconds apart withreset_inner
capability. Previously, 2.2 seconds apart total time. Please, no worries about phmap vs. emhash. Lacking the ability to clear/reset the sub map was the reason greater than 2 seconds apart (for variable length keys).llil4emh.cc
Here too, the sub maps are managed at the application level. Including, key
hash_value
stored with the key.