Open abonander opened 5 years ago
@abonander
This message got too long, so the short version is that I would like to wait for const generics to be merged before doing this, but I think we can still get approximately the same performance constraints by appeasing the branch predictor for the dynamic bit count case. It is an absolute must that we pass a power of two number of bytes as the dynamic size though to get the performance required. Simply zero-pad your features before passing in and it will work the same. I think I can support this use case for you right now by making another tree that is called something like HwtDyn
. I may perform some more optimizations on u128
before moving to the dynamic tree though.
As it stands, the only integral thing about the current HWT algorithm is that the feature size be a power of 2. This is simply because sub bit strings are exactly halves of the parent strings they come from. This is not actually necessary in practice, so long as the size of the bit strings is consistent at each layer of the tree. There don't even need to be 2 substrings per parent string either, but this maximizes the speed of operation since it minimizes the number of leaf node searches in the tree. I also wouldn't be able to efficiently use SIMD if it were not powers of two as well. I can't pack all the bits to the right in each bit substring in parallel if they are different sizes, nor can I perform several hamming weight checks in parallel.
We could allow dynamic feature sizes for the leaves stored in the tree, but make it so the features are zero padded to the next largest power of two. I think this is probably the option we want to go with. However, I don't want to store the features in a vector or other heap-allocated object. We can abstract over an array of u8
for instance (like Hwt<[u8; 72]>
), but I would like to then be able to figure out how many power-of-two chunks there are in the underlying array because right now the way it works is I have four features in memory next to each other so I can just load them into a 128x4 SIMD register and do four feature comparisons in parallel. To get the performance required, it is necessary to know the next largest power of two for the feature sizes, put the data in memory zero-padded into that power of two size, and use the appropriate SIMD instruction to get the highest performance. There is also a weird situation on intel processors where 256x2 is less efficient simply because packed_simd
(the crate I am using for SIMD) doesn't have a way to get a 256-wide SIMD lane to perform the popcount for each half of the number (whereas I can do it for 128 in parallel and 512 by summing the lanes), so I would need to do some extra operations to get the sum of the 256-bit one versus 512-bit one.
Without overwhelming you with more details, the point I want to make here is that I need to have precise information about the exact operations I need to make based on the next power-of-two width of the feature and run benchmarks to ensure each case is optimized appropriately. To this end, I believe I could wait for const generics in Rust so that I can define, for different power of two widths, implementations that have optimal performance.
Alternatively, since this in an unstable API, I think that we could make a compromise that appeases the branch predictor. We can just have the power of two stored dynamically in the tree and then every time I need to do something based on the size, I can just use a switch statement. The switch statement should make the branch predictor go the right way every time and it shouldn't have much overhead. I wont let the features be non powers of two though because that would impact my ability to load them into SIMD registers for processing.
@abonander Also, I would keep in mind that if you have binary features, you don't want to use a VP tree. See Thick Boundaries in Binary Space and Their Influence on Nearest-Neighbor Search for details.
@abonander I am considering focusing less on the design of this tree until const generics comes out because I would like to focus on implementing an ANN algorithm instead of an exact NN algorithm. I will leave this issue open in case others want to implement this. For the time being, you can pad your 64-bit features with 0s, though it will take twice as much memory.
This graph demonstrates the high performance of ANN algorithms, hence the reason for my shift in interest.
@abonander If you are interested, I have now published a crate implementing the state of the art ANN algorithm for binary feature/image hash searches, HNSW, in Rust, and it performs significantly better than this crate. It also supports 256-bit features. Please visit here:
I also haven't forgotten about this crate, but I am still much more interested in the ANN problem than the exact nearest neighbors problem. Let me know if that helps :+1:.
That's neat, and my use-case doesn't technically need exact nearest neighbors. However, I still want to support dynamically sized features (they're constant over the lifetime of the datastructure but dynamic in origin; I want to be able to vary them between hashing runs).
I'm not familiar with HNSW so maybe you can say if it needs power-of-2-length bitstrings like HWT does. If it doesn't then perhaps it would be even easier to support dynamic bitstrings (in fixed-size storage as well as Vec
).
@abonander HNSW actually does not require power of two sized bitstrings. It has a trait in the crate you can implement for any number of bits. Please see the implementations in the crate for details, since its a bit weird. Due to some limitations with typenum, you have to also tell it the maximum distance + 1 as the Distances associated type. You can also adjust the parameters to get high recall. Even if it doesn't give you an exact nearest neighbor, it is likely to give you a very close neighbor at least.
I would also add that the discrete verson doesn't work well with dynamic at runtime sized bitstrings. However, you can use the regular HNSW in the crate instead of the discrete version. That verson will let you even compare bitstrings of entirely different lengths, if you have a method to compute their distance.
Neat, although to the orphan rule I can't create an impl of *Distance
for Hamming<Vec<u8>>
in my own code so I would have to recreate it entirely.
I'm curious why HNSW
works better than DiscreteHNSW
though, and what meaning could be derived from comparing bitstrings of different lengths. Maybe a fuzzy comparison that ignores bits?
@abonander It might seem a bit weird, but you can implement the Distance traits on things that arent wrapped by Hamming and it will work fine. The algorithm only needs distance to work. I hadn't thought of a good way to cleanly handle that yet. The DiscreteHNSW has a faster priority queue than the regular HNSW. I actually just realized though that I can probably make it resize dynamically up-front without any overhead instead of using an array internally, so that will be my next step.
@abonander I have significantly modified HNSW after some benchmarks and I can safely say that you can use features that are dynamically sized at runtime without any significant performance degradation. See the docs here for Distance
. Although the implementation for a byte slice wont be as fast as the others, you can use that to get relatively high performance matching on runtime-sized bytes. You can also implement the Distance
trait on your own type. I actually think the orphan rules will allow you to implement Distatance
on Hamming<T>
where T is a type defined in your crate, so that should get around the orphan rule problem. Give it a try!
The graphs in the 0.2.0 release are not updated to reflect the new benchmarks, but it is slightly faster now. I will work on that later.
I'm interested in using this crate to collate hashes produced by my img_hash library. I started to work on an implementation of a multiple vantage-point tree but I haven't had the time to fill out the tests or even finish the documentation and I have a feeling it's going to perform pretty abysmally by default. I really prefer your approach in this crate as I feel like we're working in the same problem space.
However, one of the features of
img_hash
is that it can produce hashes of varying sizes depending on a user's preference between performance and accuracy. By default, I have it producing 64-bit hashes so representing them asu128
would be pretty wasteful, and I also want to experiment with hash sizes of 256 bits and maybe greater so in that caseu128
would require discarding information from the hash.In the 3.0.0-alpha I have yet to release I have a generic representation of hash bytes which gives the option of either allocating them on the heap (for dynamic sizes) or inline (to save a misdirection and possibly improve codegen). I think this kind of generic approach would benefit
hwt
as well but I'm wondering what your opinion is of it.I don't entirely understand the HWT algorithm so I don't know if 128-bit features are integral to its function but I have a feeling it wouldn't be too difficult to generalize over any size feature as long as they're all the same size in the tree.