voxel51 / fiftyone-brain

Open source AI/ML capabilities for the FiftyOne ecosystem
https://fiftyone.ai/brain.html
Apache License 2.0
128 stars 3 forks source link

Add support for missing data points when calling use_view() #113

Closed brimoor closed 1 year ago

brimoor commented 2 years ago

Adds an optional allow_missing=True flag to:

results.use_view(view, allow_missing=True)

that allows the method to gracefully continue in cases where view contains data points that the results index does not have data for. The examples below illustrate why this is useful.

In practice, allow_missing=True will likely be needed because compute_similarity() and compute_visualization() are not automatically updated, and, therefore, if one adds new samples to a dataset and then tries to fire up a visualization for a dataset that now has new data points, the lack of embedding data will need to be gracefully handled.

Any in-App visualization should likely have an alert when missing_size > 0:

This visualization does not include ${missing_size} samples/patches in your dataset

Images example

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.brain.internal.models as fbm  # a faster model than the default :)
import fiftyone.zoo as foz
from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset("quickstart")

# Generate visualization for images with >= 10 ground truth objects
crowded_view = dataset.match(F("ground_truth.detections").length() > 10)
model = fbm.load_model("simple-resnet-cifar10")
results = fob.compute_visualization(crowded_view, model=model)

# Grab a random set of data
other_view = dataset.take(100, seed=51)

# Even though our results only contain points for `crowded_view`, let's go
# ahead and ask it to use a view that contains other stuff
results.use_view(other_view, allow_missing=True)

# This is the total number of embeddings in the index
print(results.total_index_size)  # 39

# This is the number of available embeddings that exist in `other_view`. Any
# operations that we do with `results` can only work with this data
print(results.index_size)  # 25

# This is the number of samples in `other_view` that we don't have data for
print(results.missing_size)  # 75

# We asked `results` to use `other_view`, so let's pull some labels from
# `other_view`, ignoring the fact that missing_size > 0
labels = other_view.values(F("ground_truth.detections").length())
print(len(labels))  # 100

#
# No problem! `visualize()` automatically detects that it should filter the labels for us
#
# The result here is that the plot only contains images with >= 10 objects,
# which is expected because we only generated a visualization of those samples
# in the first place
#
plot = results.visualize(labels=labels)
plot.show()

# Now let's be more idiomatic and ask `results` to give us a view that
# contains only the patches that we can actually visualize
available_view = results.view
available_labels = available_view.values(F("ground_truth.detections").length())
print(len(available_labels))  # 25

# No problem here either! `visualize()` detects that you already filtered the labels
plot = results.visualize(labels=available_labels)
plot.show()

Object patches example

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.brain.internal.models as fbm  # a faster model than the default :)
import fiftyone.zoo as foz
from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset("quickstart")

# Generate visualization for `person` patches
person_view = dataset.filter_labels("ground_truth", F("label") == "person")
model = fbm.load_model("simple-resnet-cifar10")
results = fob.compute_visualization(
    person_view, patches_field="ground_truth", model=model
)

# Grab a random set of labels
other_view = dataset.take(100, seed=51)

# Even though our results only contain points for `person` patches, let's go
# ahead and ask it to use a view that contains other stuff
results.use_view(other_view, allow_missing=True)

# This is the total number of embeddings in the index
print(results.total_index_size)  # 378

# This is the number of available embeddings that exist in `other_view`. Any
# operations that we do with `results` can only work with this data
print(results.index_size)  # 190

# This is the number of patches in `other_view` that we don't have data for
print(results.missing_size)  # 510

# We asked `results` to use `other_view`, so let's pull some labels from
# `other_view`, ignoring the fact that missing_size > 0
labels = other_view.values("ground_truth.detections.label")
print(sum(len(l) for l in labels))  # 700

#
# No problem! `visualize()` automatically detects that it should filter the labels for us
#
# The result here is that the plot only contains `person` labels, which is
# expected because we only generated a visualization of `person` patches in
# the first place
#
plot = results.visualize(labels=labels)
plot.show()

# Now let's be more idiomatic and ask `results` to give us a view that
# contains only the patches that we can actually visualize
available_view = results.view
available_labels = available_view.values("ground_truth.detections.label")
print(sum(len(l) for l in available_labels))  # 190

# No problem here either! `visualize()` detects that you already filtered the labels
plot = results.visualize(labels=available_labels)
plot.show()
brimoor commented 1 year ago

@ehofesmann yeah that would be nice too. Unfortunately it's not possible with the current implementation because the sample/label IDs at the time the embeddings/points are computed are not currently saved in the brain results. We should definitely change that so that, going forward, all new brain results can gracefully handle deleted data.

brimoor commented 1 year ago

Well, I guess the specific thing you mentioned is possible (deleting data between when the results are loaded and when you call a method on the results object), but the more general problem of data being modified between the computation and loading of the results is the thing I'd like to solve 🤗