vigna / sux

Succinct data structures in C/C++
Other
82 stars 17 forks source link

Add estimate for number of bits used by RecSplit #13

Closed ByteHamster closed 1 year ago

vigna commented 2 years ago

Good idea, and I don't even know why it wasn't there, but, just to check, did you compare it with the actual on-disk size?

ByteHamster commented 2 years ago

I just tried a few configurations (see below). By the way, there is a slight inconsistency in space usage reported between the STATS output and my new getBits() method: The text output of RecSplit uses builder.getBits(). The actual space consumption is more close to descriptors.getBits(), which is usually about 100 bits larger.

Comparison (N=100k)

Configuration new getBits() STATS output File size
l=4, b=100 2.22 2.20 2.22
l=4, b=2000 2.15 2.13 2.14
l=8, b=100 1.81 1.78 1.80
l=8, b=2000 1.73 1.71 1.72

Comparison (N=1 Million)

Configuration new getBits() STATS output File size
l=4, b=100 2.216 2.214 2.216
l=4, b=2000 2.133 2.131 2.133
l=8, b=100 1.791 1.788 1.791
l=8, b=2000 1.711 1.709 1.711
vigna commented 2 years ago

I'm trying to understand the discrepancy. Do you think it's because of the RecSplit data structure size (which is not included in builder.getBits())?

vigna commented 2 years ago

One potential problem I see with using builder.getBits() is that you don't get the rounding to the next word. On the other hand, if you are measuring the number of bits of the data structure in an algorithmic sense that rounding shouldn't be there.

ByteHamster commented 2 years ago

One potential problem I see with using builder.getBits() is that you don't get the rounding to the next word.

In my experiments, builder.getBits() - descriptors.getBits() is usually about 100 bits, so it is a little more than just rounding to the next word. Not sure what exactly is happening there. For large N, this becomes mostly irrelevant.

On the other hand, if you are measuring the number of bits of the data structure in an algorithmic sense that rounding shouldn't be there.

From a practical perspective, reporting the non-rounded version (builder.getBits()) would require storing the number explicitly somewhere, because builder is not available after construction is complete. That could look like cheating. Contrarily, the value of descriptors.getBits() it is available in getBits().

Do you think it's because of the RecSplit data structure size (which is not included in builder.getBits())?

sizeof(RecSplit) is 264 bytes / 2112 bits on my machine. Is it really a fair comparison with other data structures if one does not count the data structure size? Many other data structures in sux, such as EliasFano, RankSelect and Vector do include their data structure size.

vigna commented 2 years ago

In my experiments, builder.getBits() - descriptors.getBits() is usually about 100 bits, so it is a little more than just rounding to the next word. Not sure what exactly is happening there. For large N, this becomes mostly irrelevant.

Neither am I...

From a practical perspective, reporting the non-rounded version (builder.getBits()) would require storing the number explicitly somewhere, because builder is not available after construction is complete. That could look like cheating. Contrarily, the value of descriptors.getBits() it is available in getBits().

From a practical viewpoint, all this discussion is irrelevant 🤷🏻‍♂️.

sizeof(RecSplit) is 264 bytes / 2112 bits on my machine. Is it really a fair comparison with other data structures if one does not count the data structure size? Many other data structures in sux, such as EliasFano, RankSelect and Vector do include their data structure size.

Yes, but it includes data that is shared by all instances.

ByteHamster commented 2 years ago

Yes, but it includes data that is shared by all instances.

Most shared data (like Golomb-Rice parameters or seeds) seems to be stored as static members or outside the class. Breaking down the space usage of the RecSplit data structure, it does not seem to include many shared things.

At least for large N, the getBits method that includes sizeof(RecSplit) seems to be closer to the file size than the STATS version that doesn't include sizeof(RecSplit).

ByteHamster commented 2 years ago

How do you want to continue here, @vigna? Which variant do you prefer?

vigna commented 2 years ago

I need some time to look at this in detail, and I have a lot of classes now...

vigna commented 1 year ago

I'm sorry for the very late answer. This somehow slipped through. Yes, the RecSplit size should be counted in for coherence with the other data structures. The bitCount() you propose, however, is a bit larger than necessary because you count the descriptor structure (not the data) twice. So,

    size_t bitCount() { return ef.bitCountCumKeys() + ef.bitCountPosition() + descriptors.getBits() + 8 * sizeof(*this) - 8 * sizeof(descriptors); }

This is still a bit larger than serialized size, because one field is not serialized but deduced, but it is certainly in line with the results of bitCount() in all other structures, and it gives correct information about the in-size memory. Thank you for making me notice this.

The STATS-displayed values are in fact counts of the number of bits that are necessary to represent a value, broken down by category. It is the only space that is dependent on the size of the data set. So after a million key or so the missing bits of the data structure differs from STATS by less than 0.1%. The paper was aiming at billions of keys, and our smallest dataset is 1M keys, so even considering the data structures the experimental data are correct. In any case, now there's an explicit item:

at 16:21:19 ➜ ./bin/recsplit_dump_8 enwiki-2022.ids 100 enwiki-2022.rs
Building...
Elias-Fano cumul sizes:  7.251121 bits/bucket
Elias-Fano cumul bits:   6.251121 bits/bucket
Elias-Fano cumul sizes:  0.072511 bits/key
Elias-Fano cumul bits:   0.062511 bits/key
Rice-Golomb descriptors: 1.656699 bits/key
Data structure:          0.000325 bits/key
Total bits:              1.792047 bits/key
Construction time: 6.565 s, 1011 ns/key
Size: 11634649

BTW, great job with https://github.com/bytehamster/gpurecsplit !