scverse / squidpy

Spatial Single Cell Analysis in Python
https://squidpy.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
440 stars 79 forks source link

sparse representation of list of coords per observation #312

Closed hspitzer closed 3 years ago

hspitzer commented 3 years ago

Type of the feature

Description

We can extract the centroids of each segmented object in a visium spot. This results in a variable length list of coordinates for each spot, currently saved in .obsm using a pandas data frame containing lists. This representation could be improved using a sparse matrix with obs x segmented objects.

This representation can then be generalised to everywhere where we have sub-spot or sub-cellular information that we'd like to store and access.

michalk8 commented 3 years ago

I will start working on this next week, still have to think about this how best to approach this.

ivirshup commented 3 years ago

@michalk8, spatialpandas may be a good project to look into for this.

michalk8 commented 3 years ago

Right now, I can't find a clean solution to this:

  1. scipy sparse matrices are 2D only, so I can't have a tensor of shape n_obs, max_segments, 2 (the last is for x, y coordinates
    • sparse matrices do not support custom dtypes, so I can't throw the coords in a dataclass/namedtuple
  2. sparse.{COO,DOK} can have 3 dims and I can pass a custom class if I want, but there are 2 issues
    • anndata doesn't have a registered dispatcher (minor issue)
    • can't specify default values to be instances of my type
  3. spatialpandas: way too heavy dependencies (e.g. pyarrow), not really worth it
  4. just having a dataframe of shape n_obs, max_segments
    • if not sparse, wastes a lot of space (custom sparse dtypes afaik not possible)
  5. custom impl.: most likely will not match the performance of pandas/jagged numpy arrays (though I haven't tried)

As far as I'm concerned, at least for centroid positions, what we current have is sufficiently flexible and efficient. As for other sub-spot info, I think it will most likely be just scalars (no 2D coordinates), so a sparse pandas dataframe would be they easiest way to go (although cluster info e.g. from Tangram might require more work, as far as the sparse values go). In any case, I'd close this for now and revisit it in the future, if needed (feel free to re-open if you disagree).