Open jeffchuber opened 1 year ago
Hi @jeffchuber, apologies for the delay here! 😅
We've used chroma internally for a few projects and found it very useful, so we'd be happy to support a FiftyOne <> Chroma integration!
A vector search integration in FiftyOne is defined by three classes:
from fiftyone.brain.similarity import SimilarityConfig, Similarity, SimilarityIndex
class ChromaSimilarityConfig(SimilarityConfig):
"""Defines the available config parameters for a Chroma similarity index."""
class ChromaSimilarity(Similarity):
"""Creates a ChromaSimilarityIndex from a ChromaSimilarityConfig."""
class ChromaSimilarityIndex(SimilarityIndex):
"""Defines how to interact with a Chroma index."""
Suppose the above classes are available in a chroma_fiftyone
package (we'd be happy to include these classes by default once they exist and work); then you could add Chroma as a similarity backend to FiftyOne by adding this to your ~/.fiftyone/brain_config.json
:
{
"similarity_backends": {
"chroma": {
"config_cls": "chroma_fiftyone.ChromaSimilarityIndex",
# other optional parameters like API key, URI, etc can go here too
},
}
}
Then Chroma can be used exactly like you see here by substituting backend="chroma"
instead of backend="pinecone"
everywhere 😄
You can see the definition of FiftyOne's similarity interface by inspecting the fiftyone.brain.similarity
module locally after you pip install fiftyone
.
You can also see how some of the other backends are implemented by checking out fiftyone.brain.internal.core.{pinecone|qdrant|...}
.
@brimoor awesome! we are totally constrained on bandwidth the next few months
Calling all Chroma users who are reading this issue: our team would be happy to support you in building a FiftyOne <> Chroma integration per the guidance I gave above! 🤗
hey @brimoor How would someone go about testing / integrating this since the brain package is not part of the open source package
Hi @timmermansjoy 👋
There's no requirement for vector search integrations to be under the fiftyone.brain
namespace; for testing/development purposes, the instructions above will work regardless of where the new classes live.
If you were to build a Chroma backend, feel free to add it to fiftyone.utils.chroma
in this repository! 🤗
FYI you can also use other vector search backends like Pinecone/Qdrant as implementation references like so:
import fiftyone.brain.internal.core.pinecone as fbp; print(fpb.__file__)
import fiftyone.brain.internal.core.qdrant as fbq; print(fpq.__file__)
Hello @brimoor Do you still need this implemented? I'm a (new) fan of the platform and have used chroma pretty extensively in projects both personally / professionally. I would be happy to give it a shot next week or the week after. @timmermansjoy are you working on this right now?
@BigCoop welcome to FiftyOne!
To my knowledge @timmermansjoy hasn't had a chance to work on this yet, so if you want to give this integration a shot, that would be fantastic! If you run into any issues, feel free to reach out here and I'll loop in some engineers who can help you 🤗
@brimoor seems pretty straightforward after looking at the reference implementations for Pinecone and Qdrant and laying some code down. Is there any automated testing I can run to see if it works, or should I just run it locally and fiddle with it before doing a PR and getting some feedback? Thanks!
Ah, the automated tests for the other backends are in a different repository that's private, so I've just copied over a couple tests here:
import random
import numpy as np
import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz
def test_image_similarity_backend(backend):
dataset = foz.load_zoo_dataset("quickstart")
prompt = "kites high in the air"
brain_key = "clip_" + backend
index = fob.compute_similarity(
dataset,
model="clip-vit-base32-torch",
metric="euclidean",
embeddings=False,
backend=backend,
brain_key=brain_key,
)
embeddings, sample_ids, _ = index.compute_embeddings(dataset)
index.add_to_index(embeddings, sample_ids)
assert index.total_index_size == 200
assert index.index_size == 200
assert index.missing_size is None
sim_view = dataset.sort_by_similarity(prompt, k=10, brain_key=brain_key)
assert len(sim_view) == 10
del index
dataset.clear_cache()
print(dataset.get_brain_info(brain_key))
index = dataset.load_brain_results(brain_key)
assert index.total_index_size == 200
embeddings2, sample_ids2, _ = index.get_embeddings()
assert embeddings2.shape == (200, 512)
assert sample_ids2.shape == (200,)
embeddings2, sample_ids2, _ = index.get_embeddings(sample_ids=ids)
assert embeddings2.shape == (100, 512)
assert sample_ids2.shape == (100,)
index.remove_from_index(sample_ids=ids)
assert index.total_index_size == 100
index.cleanup()
dataset.delete_brain_run(brain_key)
dataset.delete()
def test_patch_similarity_backend(backend):
dataset = foz.load_zoo_dataset("quickstart")
view = dataset.to_patches("ground_truth")
prompt = "cute puppies"
brain_key = "gt_clip_" + backend
index = fob.compute_similarity(
dataset,
patches_field="ground_truth",
model="clip-vit-base32-torch",
metric="euclidean",
embeddings=False,
backend=backend,
brain_key=brain_key,
)
embeddings, sample_ids, label_ids = index.compute_embeddings(dataset)
index.add_to_index(embeddings, sample_ids, label_ids=label_ids)
assert index.total_index_size == 1232
assert index.index_size == 1232
assert index.missing_size is None
sim_view = view.sort_by_similarity(prompt, k=10, brain_key=brain_key)
assert len(sim_view) == 10
del index
dataset.clear_cache()
print(dataset.get_brain_info(brain_key))
index = dataset.load_brain_results(brain_key)
assert index.total_index_size == 1232
embeddings2, sample_ids2, label_ids2 = index.get_embeddings()
assert embeddings2.shape == (1232, 512)
assert sample_ids2.shape == (1232,)
assert label_ids2.shape == (1232,)
embeddings2, sample_ids2, label_ids2 = index.get_embeddings(label_ids=ids)
assert embeddings2.shape == (100, 512)
assert sample_ids2.shape == (100,)
assert label_ids2.shape == (100,)
index.remove_from_index(label_ids=ids)
assert index.total_index_size == 1132
index.cleanup()
dataset.delete_brain_run(brain_key)
dataset.delete()
What's the
label_ids
parameter here referencing? Seems like it's internal and relevant, so I just added the four lines in for the chroma section. Not a huge fan of writing code and not knowing what it's doing though haha. Is it some internal id to keep track of labels that have been applied to certain samples / images / videos ?
def remove_from_index(
self,
sample_ids=None,
label_ids=None,
allow_missing=True,
warn_missing=False,
reload=True,
):
if label_ids is not None:
ids = label_ids
else:
ids = sample_ids
qids = self._to_qdrant_ids(ids)
if warn_missing or not allow_missing:
response = self._retrieve_points(qids, with_vectors=False)
existing_ids = self._to_fiftyone_ids([r.id for r in response])
missing_ids = list(set(ids) - set(existing_ids))
num_missing_ids = len(missing_ids)
if num_missing_ids > 0:
if not allow_missing:
raise ValueError(
"Found %d IDs (eg %s) that do not exist in the index"
% (num_missing_ids, missing_ids[0])
)
if warn_missing and not allow_missing:
logger.warning(
"Skipping %d IDs that do not exist in the index",
num_missing_ids,
)
self._client.delete(
collection_name=self.config.collection_name,
points_selector=qmodels.PointIdsList(points=qids),
)
if reload:
self.reload()
def remove_from_index(
self,
sample_ids=None,
label_ids=None,
allow_missing=True,
warn_missing=False,
reload=True,
):
if label_ids is not None:
ids = label_ids
else:
ids = sample_ids
if not allow_missing or warn_missing:
existing_ids = self._index.fetch(ids).vectors.keys()
missing_ids = set(existing_ids) - set(ids)
num_missing = len(missing_ids)
if num_missing > 0:
if not allow_missing:
raise ValueError(
"Found %d IDs (eg %s) that are not present in the "
"index" % (num_missing, missing_ids[0])
)
if warn_missing:
logger.warning(
"Ignoring %d IDs that are not present in the index",
num_missing,
)
self._index.delete(ids=ids)
if reload:
self.reload()
label_ids
are used when generating search indexes for object patches rather than entire images. For example in the test_patch_similarity_backend()
test case above they are used.
When working with entire images, there's just one sample_id
for each sample, which needs to be stored as metadata on the vector index so that k neighbors queries can return the sample IDs of the matching vectors.
When working with object patches, we need to store two IDs for each object: the object's label_id
and also the sample_id
of the sample that contains it. k neighbors queries always return the label IDs of matching vectors in this case, but the sample IDs are also stored so that methods like remove_from_index()
can optionally be passed sample IDs rather than label IDs.
Hopefully this context + test cases above + inspecting the existing backends will help elucidate!
@brimoor Thanks so much for the breakdown! I am moving / traveling across the country today and tomorrow but I should be able to finish or almost finish this on Thursday.
@brimoor ok I have an implementation that's working on unittests without a config ~ I'm having a pretty nondescript issue where I simply don't have a config fille at the path you mentioned (~/.fiftyone/brain_config.json) after doing a source / developer or pip install, so I'm having a bit of trouble defining anything or testing that portion of the interface. Would it be worth doing a PR or posting the code somewhere to get some feedback ? I'm completely new to opensource, so this could be a ridiculous question.
A ~/.fiftyone/brain_config.json
file doesn't exist by default; it's just created on an as-needed basis when you need to configure parameters for new or existing backends. In this case, you can just create one and populate as I mentioned in https://github.com/voxel51/fiftyone/issues/3360#issuecomment-1976881064
ah okay great, thank you for the clarification @brimoor
Hi there,
Any interest in a ChromaDB integration, similar to Pinecone?
Willingness to contribute
The FiftyOne Community welcomes contributions! Would you or another member of your organization be willing to contribute an implementation of this feature?