glue-viz / glue

Linked Data Visualizations Across Multiple Files
http://glueviz.org
Other
721 stars 152 forks source link

Incompatible Subsets use too much memory #2405

Closed jfoster17 closed 8 months ago

jfoster17 commented 1 year ago

Describe the bug When an image viewer is open, creating an Incompatible Subset (that is, a subset defined over attributes not present in the reference_data of the image viewer) will create an extra array that is the full size of reference_data.

To Reproduce From the glue terminal:

import numpy as np
from glue.core import Data
im = Data(label='data1',x=np.arange(100_000_000).reshape((10000, 10000)))
catalog = Data(label='catalog', c=[1, 3, 2], d=[4, 3, 3])
dc.append(im)
dc.append(catalog)
from glue.viewers.image.qt.data_viewer import ImageViewer
viewer = session.application.new_data_viewer(ImageViewer)
viewer.add_data(im)

glue should be using about 1 GB of memory at this point. Creating a subset that can be shown:

dc.new_subset_group(subset_state=im.id['x'] > 50_000_000, label='A')

does not appreciably increase the memory usage of glue. However, creating a subset that cannot be shown:

dc.new_subset_group(subset_state=catalog.id['c'] > 2, label='B')

Causes memory usage to ~double to 2 GB. New Incompatible Subsets continue to grow the memory used.

Expected behavior Incompatible subsets should not take up more memory that normal subsets.

Additional context ImageSubsetArray is called when a new subset tries to show itself on an Image Viewer. The __call__ method creates a broadcasted array of np.nan that is the full size of the Image Viewer reference_data. This array is then used in the make_image function of mpl_scatter_density.base_image_artist line 190 (self.set_data(array)) which causes the broadcasted array to materialize fully in memory.

I think the problem would essentially be solved by returning just the portion of the large nan array defined by bounds (if we don't trigger the IncompatibleAttribute we get a mask with the shape of bounds), but I confess I don't really understand why we're making a potentially giant nan array here in the first place, so I wanted to open a discussion before trying to fix this.