voxel51 / fiftyone

Refine high-quality datasets and visual AI models
https://fiftyone.ai
Apache License 2.0
8.88k stars 563 forks source link

[FR] Chroma integration #3360

Open jeffchuber opened 1 year ago

jeffchuber commented 1 year ago

Hi there,

Any interest in a ChromaDB integration, similar to Pinecone?

Willingness to contribute

The FiftyOne Community welcomes contributions! Would you or another member of your organization be willing to contribute an implementation of this feature?

brimoor commented 8 months ago

Hi @jeffchuber, apologies for the delay here! 😅

We've used chroma internally for a few projects and found it very useful, so we'd be happy to support a FiftyOne <> Chroma integration!

A vector search integration in FiftyOne is defined by three classes:

from fiftyone.brain.similarity import SimilarityConfig, Similarity, SimilarityIndex

class ChromaSimilarityConfig(SimilarityConfig):
    """Defines the available config parameters for a Chroma similarity index."""

class ChromaSimilarity(Similarity):
    """Creates a ChromaSimilarityIndex from a ChromaSimilarityConfig."""

class ChromaSimilarityIndex(SimilarityIndex):
    """Defines how to interact with a Chroma index."""

Suppose the above classes are available in a chroma_fiftyone package (we'd be happy to include these classes by default once they exist and work); then you could add Chroma as a similarity backend to FiftyOne by adding this to your ~/.fiftyone/brain_config.json:

{
    "similarity_backends": {
        "chroma": {
            "config_cls": "chroma_fiftyone.ChromaSimilarityIndex",
            # other optional parameters like API key, URI, etc can go here too
        },
    }
}

Then Chroma can be used exactly like you see here by substituting backend="chroma" instead of backend="pinecone" everywhere 😄

brimoor commented 8 months ago

You can see the definition of FiftyOne's similarity interface by inspecting the fiftyone.brain.similarity module locally after you pip install fiftyone.

You can also see how some of the other backends are implemented by checking out fiftyone.brain.internal.core.{pinecone|qdrant|...}.

jeffchuber commented 8 months ago

@brimoor awesome! we are totally constrained on bandwidth the next few months

brimoor commented 8 months ago

Calling all Chroma users who are reading this issue: our team would be happy to support you in building a FiftyOne <> Chroma integration per the guidance I gave above! 🤗

timmermansjoy commented 5 months ago

hey @brimoor How would someone go about testing / integrating this since the brain package is not part of the open source package

brimoor commented 5 months ago

Hi @timmermansjoy 👋

There's no requirement for vector search integrations to be under the fiftyone.brain namespace; for testing/development purposes, the instructions above will work regardless of where the new classes live.

If you were to build a Chroma backend, feel free to add it to fiftyone.utils.chroma in this repository! 🤗

FYI you can also use other vector search backends like Pinecone/Qdrant as implementation references like so:

import fiftyone.brain.internal.core.pinecone as fbp; print(fpb.__file__)
import fiftyone.brain.internal.core.qdrant as fbq; print(fpq.__file__)
BigCoop commented 5 months ago

Hello @brimoor Do you still need this implemented? I'm a (new) fan of the platform and have used chroma pretty extensively in projects both personally / professionally. I would be happy to give it a shot next week or the week after. @timmermansjoy are you working on this right now?

brimoor commented 5 months ago

@BigCoop welcome to FiftyOne!

To my knowledge @timmermansjoy hasn't had a chance to work on this yet, so if you want to give this integration a shot, that would be fantastic! If you run into any issues, feel free to reach out here and I'll loop in some engineers who can help you 🤗

BigCoop commented 5 months ago

@brimoor seems pretty straightforward after looking at the reference implementations for Pinecone and Qdrant and laying some code down. Is there any automated testing I can run to see if it works, or should I just run it locally and fiddle with it before doing a PR and getting some feedback? Thanks!

brimoor commented 5 months ago

Ah, the automated tests for the other backends are in a different repository that's private, so I've just copied over a couple tests here:

import random
import numpy as np

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

def test_image_similarity_backend(backend):
    dataset = foz.load_zoo_dataset("quickstart")

    prompt = "kites high in the air"
    brain_key = "clip_" + backend

    index = fob.compute_similarity(
        dataset,
        model="clip-vit-base32-torch",
        metric="euclidean",
        embeddings=False,
        backend=backend,
        brain_key=brain_key,
    )

    embeddings, sample_ids, _ = index.compute_embeddings(dataset)

    index.add_to_index(embeddings, sample_ids)
    assert index.total_index_size == 200
    assert index.index_size == 200
    assert index.missing_size is None

    sim_view = dataset.sort_by_similarity(prompt, k=10, brain_key=brain_key)
    assert len(sim_view) == 10

    del index
    dataset.clear_cache()

    print(dataset.get_brain_info(brain_key))

    index = dataset.load_brain_results(brain_key)
    assert index.total_index_size == 200

    embeddings2, sample_ids2, _ = index.get_embeddings()
    assert embeddings2.shape == (200, 512)
    assert sample_ids2.shape == (200,)

    embeddings2, sample_ids2, _ = index.get_embeddings(sample_ids=ids)
    assert embeddings2.shape == (100, 512)
    assert sample_ids2.shape == (100,)

    index.remove_from_index(sample_ids=ids)

    assert index.total_index_size == 100

    index.cleanup()
    dataset.delete_brain_run(brain_key)

    dataset.delete()

def test_patch_similarity_backend(backend):
    dataset = foz.load_zoo_dataset("quickstart")
    view = dataset.to_patches("ground_truth")

    prompt = "cute puppies"
    brain_key = "gt_clip_" + backend

    index = fob.compute_similarity(
        dataset,
        patches_field="ground_truth",
        model="clip-vit-base32-torch",
        metric="euclidean",
        embeddings=False,
        backend=backend,
        brain_key=brain_key,
    )

    embeddings, sample_ids, label_ids = index.compute_embeddings(dataset)

    index.add_to_index(embeddings, sample_ids, label_ids=label_ids)
    assert index.total_index_size == 1232
    assert index.index_size == 1232
    assert index.missing_size is None

    sim_view = view.sort_by_similarity(prompt, k=10, brain_key=brain_key)
    assert len(sim_view) == 10

    del index
    dataset.clear_cache()

    print(dataset.get_brain_info(brain_key))

    index = dataset.load_brain_results(brain_key)
    assert index.total_index_size == 1232

    embeddings2, sample_ids2, label_ids2 = index.get_embeddings()
    assert embeddings2.shape == (1232, 512)
    assert sample_ids2.shape == (1232,)
    assert label_ids2.shape == (1232,)

    embeddings2, sample_ids2, label_ids2 = index.get_embeddings(label_ids=ids)
    assert embeddings2.shape == (100, 512)
    assert sample_ids2.shape == (100,)
    assert label_ids2.shape == (100,)

    index.remove_from_index(label_ids=ids)

    assert index.total_index_size == 1132

    index.cleanup()
    dataset.delete_brain_run(brain_key)

    dataset.delete()
BigCoop commented 5 months ago

Question

What's the label_ids parameter here referencing? Seems like it's internal and relevant, so I just added the four lines in for the chroma section. Not a huge fan of writing code and not knowing what it's doing though haha. Is it some internal id to keep track of labels that have been applied to certain samples / images / videos ?

Qdrant function reference

def remove_from_index(
    self,
    sample_ids=None,
    label_ids=None,
    allow_missing=True,
    warn_missing=False,
    reload=True,
):
    if label_ids is not None:
        ids = label_ids
    else:
        ids = sample_ids

    qids = self._to_qdrant_ids(ids)

    if warn_missing or not allow_missing:
        response = self._retrieve_points(qids, with_vectors=False)

        existing_ids = self._to_fiftyone_ids([r.id for r in response])
        missing_ids = list(set(ids) - set(existing_ids))
        num_missing_ids = len(missing_ids)

        if num_missing_ids > 0:
            if not allow_missing:
                raise ValueError(
                    "Found %d IDs (eg %s) that do not exist in the index"
                    % (num_missing_ids, missing_ids[0])
                )
            if warn_missing and not allow_missing:
                logger.warning(
                    "Skipping %d IDs that do not exist in the index",
                    num_missing_ids,
                )

    self._client.delete(
        collection_name=self.config.collection_name,
        points_selector=qmodels.PointIdsList(points=qids),
    )

    if reload:
        self.reload()

Pinecone function

def remove_from_index(
    self,
    sample_ids=None,
    label_ids=None,
    allow_missing=True,
    warn_missing=False,
    reload=True,
):
    if label_ids is not None:
        ids = label_ids
    else:
        ids = sample_ids

    if not allow_missing or warn_missing:
        existing_ids = self._index.fetch(ids).vectors.keys()
        missing_ids = set(existing_ids) - set(ids)
        num_missing = len(missing_ids)

        if num_missing > 0:
            if not allow_missing:
                raise ValueError(
                    "Found %d IDs (eg %s) that are not present in the "
                    "index" % (num_missing, missing_ids[0])
                )

            if warn_missing:
                logger.warning(
                    "Ignoring %d IDs that are not present in the index",
                    num_missing,
                )

    self._index.delete(ids=ids)

    if reload:
        self.reload()
brimoor commented 4 months ago

label_ids are used when generating search indexes for object patches rather than entire images. For example in the test_patch_similarity_backend() test case above they are used.

When working with entire images, there's just one sample_id for each sample, which needs to be stored as metadata on the vector index so that k neighbors queries can return the sample IDs of the matching vectors.

When working with object patches, we need to store two IDs for each object: the object's label_id and also the sample_id of the sample that contains it. k neighbors queries always return the label IDs of matching vectors in this case, but the sample IDs are also stored so that methods like remove_from_index() can optionally be passed sample IDs rather than label IDs.

Hopefully this context + test cases above + inspecting the existing backends will help elucidate!

BigCoop commented 4 months ago

@brimoor Thanks so much for the breakdown! I am moving / traveling across the country today and tomorrow but I should be able to finish or almost finish this on Thursday.

BigCoop commented 4 months ago

@brimoor ok I have an implementation that's working on unittests without a config ~ I'm having a pretty nondescript issue where I simply don't have a config fille at the path you mentioned (~/.fiftyone/brain_config.json) after doing a source / developer or pip install, so I'm having a bit of trouble defining anything or testing that portion of the interface. Would it be worth doing a PR or posting the code somewhere to get some feedback ? I'm completely new to opensource, so this could be a ridiculous question.

brimoor commented 4 months ago

A ~/.fiftyone/brain_config.json file doesn't exist by default; it's just created on an as-needed basis when you need to configure parameters for new or existing backends. In this case, you can just create one and populate as I mentioned in https://github.com/voxel51/fiftyone/issues/3360#issuecomment-1976881064

BigCoop commented 4 months ago

ah okay great, thank you for the clarification @brimoor

BigCoop commented 4 months ago

Made a PR and asked for some help on file placement since it seems to be preventing testing. @brimoor thanks for the help!