microsoft / ALEX

A library for building an in-memory, Adaptive Learned indEX
MIT License
662 stars 114 forks source link

Question about Figure 9: ALEX supports redundant key values, but STX B+ Tree does not has such feature #12

Closed kaiwang19 closed 3 years ago

kaiwang19 commented 3 years ago

Dear Jialing,

ALEX supports inserting <key, payload> pairs with the same key. For example, if you insert the following 10 pairs with the same key 42:

The inserting results will keep the 10 pairs in the ALEX structure.

However, for STX B+ Tree[5], it supports inserting only 1 <key, payload> pair with a specific key. Therefore, if you insert the 10 pairs with key 42 in the previous example, STX B+ Tree will keep only 1 pair of them. Actually, only the first pair can be inserted while the rest 9 pairs will fail.

image

image

I thus wonder how to conduct the experiments in Figure 9 to compare the throughput between ALEX and B+ Tree. Do you change the STX B+ Tree(v0.9) so that B+ Tree could support inserting pairs with the same key?

jialinding commented 3 years ago

In our paper, our datasets did not contain any duplicate keys, so the situation you describe never arose.

kaiwang19 commented 3 years ago

Ah, I see. Thank you.

kaiwang19 commented 3 years ago

Dear Jialing,

Thanks to your help, I can successfully replicate experiments in Figure 9:(a)-(d) now. I am not so sure about details to compute the index size in Figure 9:(f)-(j). image

I have 4 questions about how to compute the index size. Would you help me clarify them?

Q1: To compute the size of a data node, can I put it this way:

To compute the size of a model node, can I put it this way:

I am not sure whether to include the size of the model if we compute the size for a node. Do I also need to include some statistics, like:

Q2: To count the size of a gapped array(for keys) in a data node, do you use all the slots(_data_capacity), or just the slots containing keys(_num_key)?

Q3: If we need to compute the size of the model, how to count it? Do we just count the size of two double variables: a(slope) and b(intercept). Or do we also need to count the size of other statistics:

Q4: I would appreciate you a lot if you could show me the computation of the following 2 examples.

We fix the key type to double, which is 8 bytes(64 bits) in C++. Then we fix the payload size to 8 bytes.

Could you show me how to calculate the size in the following 2 examples:

Thanks a lot~

jialinding commented 3 years ago

You can refer to https://github.com/microsoft/ALEX/blob/master/src/core/alex.h#L2297 and https://github.com/microsoft/ALEX/blob/master/src/core/alex.h#L2311 for the methods to compute data size and model size. Index size in the plots is equivalent to model size.

kaiwang19 commented 3 years ago

Ah, gotcha! It is great to have the two methods~

kaiwang19 commented 3 years ago

Dear Jialing,

I have figured out how to compute the node sizes now. May I ask what is the data type you use for the payload (80 bytes) of YCSB? I have tried char[80], but it causes many bugs. image

Thank you so much for your time.

jialinding commented 3 years ago

You may need to wrap the char array in a struct or use std::array<char, 80>.

We used an 80-byte struct where the first 8 bytes represented a randomly-generated uint64_t:

// 80-byte payload
struct big_payload {
    uint64_t first_value;
    char remaining_values[72];

    bool operator< (const big_payload &other) const {
        return first_value < other.first_value;
    }
};

The only reason we did this was so that we could easily sum payload values in our benchmarks.

If you're seeing bugs in the provided benchmark executable, that's because you'll need to write your own way of randomly generating and summing payloads (or just comment them out).

kaiwang19 commented 3 years ago

Great! Thank you so much!