Closed ByteHamster closed 1 year ago
I just tried a few configurations (see below). By the way, there is a slight inconsistency in space usage reported between the STATS
output and my new getBits()
method: The text output of RecSplit uses builder.getBits()
. The actual space consumption is more close to descriptors.getBits()
, which is usually about 100 bits larger.
Configuration | new getBits() |
STATS output |
File size |
---|---|---|---|
l=4, b=100 | 2.22 | 2.20 | 2.22 |
l=4, b=2000 | 2.15 | 2.13 | 2.14 |
l=8, b=100 | 1.81 | 1.78 | 1.80 |
l=8, b=2000 | 1.73 | 1.71 | 1.72 |
Configuration | new getBits() |
STATS output |
File size |
---|---|---|---|
l=4, b=100 | 2.216 | 2.214 | 2.216 |
l=4, b=2000 | 2.133 | 2.131 | 2.133 |
l=8, b=100 | 1.791 | 1.788 | 1.791 |
l=8, b=2000 | 1.711 | 1.709 | 1.711 |
I'm trying to understand the discrepancy. Do you think it's because of the RecSplit data structure size (which is not included in builder.getBits())?
One potential problem I see with using builder.getBits() is that you don't get the rounding to the next word. On the other hand, if you are measuring the number of bits of the data structure in an algorithmic sense that rounding shouldn't be there.
One potential problem I see with using builder.getBits() is that you don't get the rounding to the next word.
In my experiments, builder.getBits() - descriptors.getBits()
is usually about 100 bits, so it is a little more than just rounding to the next word. Not sure what exactly is happening there. For large N, this becomes mostly irrelevant.
On the other hand, if you are measuring the number of bits of the data structure in an algorithmic sense that rounding shouldn't be there.
From a practical perspective, reporting the non-rounded version (builder.getBits()
) would require storing the number explicitly somewhere, because builder
is not available after construction is complete. That could look like cheating. Contrarily, the value of descriptors.getBits()
it is available in getBits()
.
Do you think it's because of the RecSplit data structure size (which is not included in builder.getBits())?
sizeof(RecSplit)
is 264 bytes / 2112 bits on my machine. Is it really a fair comparison with other data structures if one does not count the data structure size? Many other data structures in sux
, such as EliasFano, RankSelect and Vector do include their data structure size.
In my experiments,
builder.getBits() - descriptors.getBits()
is usually about 100 bits, so it is a little more than just rounding to the next word. Not sure what exactly is happening there. For large N, this becomes mostly irrelevant.
Neither am I...
From a practical perspective, reporting the non-rounded version (
builder.getBits()
) would require storing the number explicitly somewhere, becausebuilder
is not available after construction is complete. That could look like cheating. Contrarily, the value ofdescriptors.getBits()
it is available ingetBits()
.
From a practical viewpoint, all this discussion is irrelevant 🤷🏻♂️.
sizeof(RecSplit)
is 264 bytes / 2112 bits on my machine. Is it really a fair comparison with other data structures if one does not count the data structure size? Many other data structures insux
, such as EliasFano, RankSelect and Vector do include their data structure size.
Yes, but it includes data that is shared by all instances.
Yes, but it includes data that is shared by all instances.
Most shared data (like Golomb-Rice parameters or seeds) seems to be stored as static members or outside the class. Breaking down the space usage of the RecSplit data structure, it does not seem to include many shared things.
At least for large N
, the getBits
method that includes sizeof(RecSplit)
seems to be closer to the file size than the STATS
version that doesn't include sizeof(RecSplit)
.
How do you want to continue here, @vigna? Which variant do you prefer?
I need some time to look at this in detail, and I have a lot of classes now...
I'm sorry for the very late answer. This somehow slipped through. Yes, the RecSplit size should be counted in for coherence with the other data structures. The bitCount()
you propose, however, is a bit larger than necessary because you count the descriptor structure (not the data) twice. So,
size_t bitCount() { return ef.bitCountCumKeys() + ef.bitCountPosition() + descriptors.getBits() + 8 * sizeof(*this) - 8 * sizeof(descriptors); }
This is still a bit larger than serialized size, because one field is not serialized but deduced, but it is certainly in line with the results of bitCount()
in all other structures, and it gives correct information about the in-size memory. Thank you for making me notice this.
The STATS-displayed values are in fact counts of the number of bits that are necessary to represent a value, broken down by category. It is the only space that is dependent on the size of the data set. So after a million key or so the missing bits of the data structure differs from STATS by less than 0.1%. The paper was aiming at billions of keys, and our smallest dataset is 1M keys, so even considering the data structures the experimental data are correct. In any case, now there's an explicit item:
at 16:21:19 ➜ ./bin/recsplit_dump_8 enwiki-2022.ids 100 enwiki-2022.rs
Building...
Elias-Fano cumul sizes: 7.251121 bits/bucket
Elias-Fano cumul bits: 6.251121 bits/bucket
Elias-Fano cumul sizes: 0.072511 bits/key
Elias-Fano cumul bits: 0.062511 bits/key
Rice-Golomb descriptors: 1.656699 bits/key
Data structure: 0.000325 bits/key
Total bits: 1.792047 bits/key
Construction time: 6.565 s, 1011 ns/key
Size: 11634649
BTW, great job with https://github.com/bytehamster/gpurecsplit !
Good idea, and I don't even know why it wasn't there, but, just to check, did you compare it with the actual on-disk size?