improbable-eng / phtree-cpp

PH-Tree C++ implementation by Improbable.
Apache License 2.0
22 stars 15 forks source link

Number of times leaf nodes are visited when running range- and knn-queries #12

Closed mpriorust closed 2 years ago

mpriorust commented 3 years ago

Dear All,

I initialised a tree: PhTreeD<DIM, int> PhTree;

I Inserted points of type PhPointD, and run PhTree.begin_query and PhTree.begin_knn_query on it.

Both queries return the expected results.

I wanted to ask if there is a way to trace the number of times leaf nodes are visited when running the queries.

Kind regards,

Max Prior

improbable-til commented 3 years ago

Hi, you could use a FIlter (see README):

template <dimension_t DIM, typename T>
struct FilterForCountingNodes {
    [[nodiscard]] constexpr bool IsEntryValid(const PhPoint<DIM>& key, const T& value) const {
        return true;
    }
    [[nodiscard]] constexpr bool IsNodeValid(const PhPoint<DIM>& prefix, int bits_to_ignore) const {
        ++count;
        return true;
    }
    size_t count{0};
};

for (auto it = tree.begin_query({1, 1, 1}, {3, 3, 3}, FilterForCountingNodes<3, T>())); it != tree.end(); ++it) {
    ...
}
for (auto it = tree.begin_query(YourDistanceFuntionHere, FilterForCountingNodes<3, T>())); it != tree.end(); ++it) {
    ...
}

Of course you could also adapt the iterator_knn.h and iterator_full.h/iterator_hc.h to count visited nodes.

Some things to note:

May I ask why you are interested in the number of node visits?

Let me know if you find any problems.

mpriorust commented 3 years ago

Thank you for your response, and the hint to use filters.

I want to measure the I/O costs by counting the number of times leaf nodes are visited when running range or knn queries. Is this possible and appropriate? If so, is there a differentiation between leaf and non-leaf nodes?

improbable-til commented 3 years ago

No there is no difference between leaf/non-leaf nodes.

I think this can be used to estimate I/O costs. I am not sure how that would work in detail because the actual "cost" depends on whether node is flushed from a cache level? There are ways to measure this, I think vtune can measure I/O wait, and I think there are other options as well. I think the CPU can report these metric and you can read them out with command line tools (don't ask me how though).

You should probably also count+measure the access to individual entries (=points). The PH-Tree keeps them in an array that belongs to the node, while other spatial indexes often store them as separate entries. Another advantage of the PH-Tree is that it first checks the HC-position so it does not necessarily need to check (or even load) all subnodes or points in a nodes, i.e. it can exclude a subnode without having to access its coordinate.

In case you are doing research: note that the C++ implementations is a bit different from the approach described in the PH-Tree research paper. For example, the C++ version does not serialize a node's data into a bitstring (which would save some space and reduce I/O). The Java version does this bit-string-serialization, but only until 8 dimensions or so. The bitstream approach does improve speed for low dimensions but becomes too expensive for remove/insert operations for higher dimensionality.