lancedb / lancedb

Developer-friendly, serverless vector database for AI applications. Easily add long-term memory to your LLM apps!
https://lancedb.github.io/lancedb/
Apache License 2.0
3.46k stars 226 forks source link

Using booleans in LanceDB #678

Open miretchin opened 6 months ago

miretchin commented 6 months ago

Hello, your friendly neighborhood computational chemist here!

I'd like to be able to perform vector searches over binary / boolean fields (in this case, molecular fingerprints e.g., https://chemicbook.com/2021/03/25/a-beginners-guide-for-understanding-extended-connectivity-fingerprints.html).

but it won't let me ->

ValueError: LanceError(IO): Column fingerprint is not a vector column (type: FixedSizeList(Field { name: "item", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2048)), /home/runner/work/lance/lance/rust/lance/src/dataset/scanner.rs:384:35

import pyarrow as pa
import lancedb
from lancedb.pydantic import LanceModel, Vector

class Molecule(LanceModel):
    id: str
    smiles: str
    fingerprint: Vector(2048, pa.bool_())

uri = "./out/strict_fragments.lancedb"
db = lancedb.connect(uri)

tbl = db.create_table(
    "molecules",
    schema = Molecule,
    mode = 'overwrite'
)

Neither does it work with any flavor of integer. It would be amazing to get this working because the sparse booleans obviously take up much less room on disk. These fingerprints are often very sparse, so it would be even better if we could use a sparse format like SparseTensor in Arrow. In addition, Jaccard is the preferred distance metric, but Cosine or inner product are preferable to euclidean.

koaning commented 3 months ago

I may also have a use-case for sparse vectors in general. It's a slightly different use-case than what @miretchin mentions, but for some data deduplication tasks it may also be useful to have sparse data going in.

I understand if this is out of scope for the project, just figured I'd mention it.

albertlockett commented 3 months ago

I think sparse vectors, non-float vector types (binary & integer) and additional distance metrics like Jaccard, are features we'd be interested in supporting with the Lance format!

We might not have time to get these implemented the near future, so for now I've added the "help wanted" label.