Open miretchin opened 11 months ago
I may also have a use-case for sparse vectors in general. It's a slightly different use-case than what @miretchin mentions, but for some data deduplication tasks it may also be useful to have sparse data going in.
I understand if this is out of scope for the project, just figured I'd mention it.
I think sparse vectors, non-float vector types (binary & integer) and additional distance metrics like Jaccard, are features we'd be interested in supporting with the Lance format!
We might not have time to get these implemented the near future, so for now I've added the "help wanted" label.
Hello, your friendly neighborhood computational chemist here!
I'd like to be able to perform vector searches over binary / boolean fields (in this case, molecular fingerprints e.g., https://chemicbook.com/2021/03/25/a-beginners-guide-for-understanding-extended-connectivity-fingerprints.html).
but it won't let me ->
ValueError: LanceError(IO): Column fingerprint is not a vector column (type: FixedSizeList(Field { name: "item", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2048)), /home/runner/work/lance/lance/rust/lance/src/dataset/scanner.rs:384:35
Neither does it work with any flavor of integer. It would be amazing to get this working because the sparse booleans obviously take up much less room on disk. These fingerprints are often very sparse, so it would be even better if we could use a sparse format like SparseTensor in Arrow. In addition, Jaccard is the preferred distance metric, but Cosine or inner product are preferable to euclidean.