Question about Figure 9: ALEX supports redundant key values, but STX B+ Tree does not has such feature

kaiwang19 commented 3 years ago

Dear Jialing,

ALEX supports inserting <key, payload> pairs with the same key. For example, if you insert the following 10 pairs with the same key 42:

<42, 1980013608>
<42, 678433864>
<42, 1161402928>
<42, 1467459259>
<42, 1692683787>
<42, 998901092>
<42, 475827625>
<42, 60631511>
<42, 281718131>
<42, 359319102>

The inserting results will keep the 10 pairs in the ALEX structure.

However, for STX B+ Tree[5], it supports inserting only 1 <key, payload> pair with a specific key. Therefore, if you insert the 10 pairs with key 42 in the previous example, STX B+ Tree will keep only 1 pair of them. Actually, only the first pair can be inserted while the rest 9 pairs will fail.

I thus wonder how to conduct the experiments in Figure 9 to compare the throughput between ALEX and B+ Tree. Do you change the STX B+ Tree(v0.9) so that B+ Tree could support inserting pairs with the same key?

jialinding commented 3 years ago

In our paper, our datasets did not contain any duplicate keys, so the situation you describe never arose.

kaiwang19 commented 3 years ago

Ah, I see. Thank you.

kaiwang19 commented 3 years ago

Dear Jialing,

Thanks to your help, I can successfully replicate experiments in Figure 9:(a)-(d) now. I am not so sure about details to compute the index size in Figure 9:(f)-(j).

I have 4 questions about how to compute the index size. Would you help me clarify them?

Q1: To compute the size of a data node, can I put it this way:

size(data node) = size(model) + size(gapped array for keys) + size(gapped array for payloads)

To compute the size of a model node, can I put it this way:

size(model node) = size(model) + size(array for pointers)

I am not sure whether to include the size of the model if we compute the size for a node. Do I also need to include some statistics, like:

isleaf:bool
duplicationfactor:uint8_t
level_:short
cost_:double

Q2: To count the size of a gapped array(for keys) in a data node, do you use all the slots(_data_capacity), or just the slots containing keys(_num_key)?

Q3: If we need to compute the size of the model, how to count it? Do we just count the size of two double variables: a(slope) and b(intercept). Or do we also need to count the size of other statistics:

count_
xsum
ysum
xxsum
xysum
xmin
xmax
ymin
ymax

Q4: I would appreciate you a lot if you could show me the computation of the following 2 examples.

We fix the key type to double, which is 8 bytes(64 bits) in C++. Then we fix the payload size to 8 bytes.

Could you show me how to calculate the size in the following 2 examples:

If there is a data node with 1 slot and 0 keys, what is the size of this data node?
If there is a data node with 14 slots and 10 keys, what is the size of this data node?

Thanks a lot~

jialinding commented 3 years ago

You can refer to https://github.com/microsoft/ALEX/blob/master/src/core/alex.h#L2297 and https://github.com/microsoft/ALEX/blob/master/src/core/alex.h#L2311 for the methods to compute data size and model size. Index size in the plots is equivalent to model size.

kaiwang19 commented 3 years ago

Ah, gotcha! It is great to have the two methods~

kaiwang19 commented 3 years ago

Dear Jialing,

I have figured out how to compute the node sizes now. May I ask what is the data type you use for the payload (80 bytes) of YCSB? I have tried char[80], but it causes many bugs.

Thank you so much for your time.

jialinding commented 3 years ago

You may need to wrap the char array in a struct or use std::array<char, 80>.

We used an 80-byte struct where the first 8 bytes represented a randomly-generated uint64_t:

// 80-byte payload
struct big_payload {
    uint64_t first_value;
    char remaining_values[72];

    bool operator< (const big_payload &other) const {
        return first_value < other.first_value;
    }
};

The only reason we did this was so that we could easily sum payload values in our benchmarks.

If you're seeing bugs in the provided benchmark executable, that's because you'll need to write your own way of randomly generating and summing payloads (or just comment them out).

kaiwang19 commented 3 years ago

Great! Thank you so much!

microsoft / ALEX

Question about Figure 9: ALEX supports redundant key values, but STX B+ Tree does not has such feature #12