voxel51 / fiftyone-brain

Open source AI/ML capabilities for the FiftyOne ecosystem
https://fiftyone.ai/brain.html
Apache License 2.0
128 stars 3 forks source link

Error handling improvements #90

Closed brimoor closed 2 years ago

brimoor commented 2 years ago

Best tested with https://github.com/voxel51/fiftyone/pull/1444.

Manual low-dimensional representations

compute_visualization(..., points=points) can now be used to provide your own manually computed low-dimensional representation for use with interactive embeddings plots. EG, users may compute their own embeddings and then perform their own UMAP reduction with customized parameters that we do not currently expose.

import numpy as np

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob

dataset = foz.load_zoo_dataset("quickstart").clone()

points = np.random.randn(len(dataset), 2)
results = fob.compute_visualization(dataset, points=points, brain_key="manual")

Ensuring requirements

Introduces a BrainMethod.ensure_requirements() method that is called prior to any expensive computations that allows for ensuring that the necessary packages are installed.

This is currently only relevant for compute_visualization() when using the UMAP backend (the default). Previously, the embeddings would be computed only to raise an error if UMAP is not installed. Now, the error will happen immediately.

Graceful handling of missing embeddings

Updates the default behavior of compute_similarity() and compute_visualization() to replace any uncomputable embeddings with zero vectors. Previously an error would be raised, but only after all embeddings were attempted to be computed.

Now, informative warnings will be printed if embeddings could not be computed, but the user will still get results that they can work with. The idea here is that typical errors are very sparse, and it is better to give the user a result with 99% good data and 1% dummy data so that they can get their work done rather than requiring that everything be computable before they can get anything to work with. Also, when viewing embedding visualization plots, using the lasso is a convenient way to actually isolate the broken samples, since they will likely be separate from any "real data" clusters

The user can pass skip_failures=False to insist that all embeddings must be computable.

Example graceful handling of bad data:

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob

dataset = foz.load_zoo_dataset("quickstart").clone()

dataset.set_values("validity", ["good"] * len(dataset))

# Give 50 samples non-existent images so embedding computaion will fail
bad_view = dataset.limit(50)
bad_view.set_values("filepath", ["/non/exsistent.png"] * len(bad_view))
bad_view.set_values("validity", ["bad"] * len(bad_view))

# Warnings are printed but results are still returned
# Bad data is clearly visible
results = fob.compute_visualization(dataset, brain_key="img_viz")
plot = results.visualize(labels="validity")
plot.show()

# Warnings are printed but results are still returned
# Bad data is clearly visible
results = fob.compute_visualization(dataset, patches_field="ground_truth", brain_key="gt_viz")
plot = results.visualize()
plot.show()

# Warnings are printed but results are still returned
fob.compute_similarity(dataset, brain_key="img_sim")
brimoor commented 2 years ago

@benjaminpkane it looks like the tests are currently failing because the latest eta:develop isn't being used. can you address this?

benjaminpkane commented 2 years ago

Not sure about the tests. They pass locally. Merging.