Open vbekiaris opened 2 months ago
Thanks for the report and the test!
IMO this is working as designed, the implicit contract of RAVV is that it should give a valid vector for ordinals from 0..size(). In other words JVector is designed to support "holes" in a graph, but not in RAVV.
Apparently there are 31 usages of size() so reviewing those to introduce getIdUpperBound the way we did for GraphIndex to accommodate holes isn't unreasonable, but I'm a bit skeptical that it's necessary. Can you share more about your use case?
(I'm also happy to eliminate the source of confusion by deleting MapRAVV. I thought it was going to be useful for Cassandra but we ended up not using it after all.)
Hi, I'd like to continue this discussion.
In our scenario, we have a collection where vectors can be added and removed, which makes it easy to create gaps in the RAVV.
Could you clarify how the library is intended to be used in this context? Should deletion be avoided, or should deleted node identifiers be reused?
All of the on-heap data structures are designed around the principle that node ids are mostly contiguous (see DenseIntMap in particular), and the on-disk structure assumes they are entirely contiguous. Additionally, removeDeletedNodes
cannot be called safely while other modifications are in flight. (See https://github.com/jbellis/jvector/issues/272.)
I think this would work for you:
I would support modifying removeDeletedNodes to return a BitSet of removed IDs so user code doesn't have to track deletions a second time.
In 97e523c306ae42c3e963484e320fa1c7432b5250
approximateCentroid()
implementation for theBuildScoreProvider
returned fromBuildScoreProvider.randomAccessScoreProvider()
was updated to allow for non-sequential node IDs.However the iteration only takes into account nodes with ID <
ravv.size()
. This means that if there are actually "holes" in the ID sequence (e.g. add 100 nodes in aMapRandomAccessVectorValues
, then remove 10 starting from 0), then some nodes (those with ID >= 90 in the example) will not be taken into account while calculating the centroid.A fix would probably require changing
RandomAccessVectorValues
API to expose an iterator or the highestnodeId
that is set (or something similar).Test that demonstrates the issue: