Closed Mr-Milk closed 1 month ago
Thanks for writing this up.
A few initial thoughts:
WSIData
from LazySlide
makes sense, especially if there is considerable engineering on the data side.SpatialData
to really see where is the bottleneck. Why would this solution "[..]not fully integrate with the SpatialData ecosystem"? Is it 'just' the lack of a greedy DataTree
made from the image data by default? If so, it doesn't seem too concerning.labels
slot. The docs say values can only be integers. What's the drawback of using tables?SpatialData
would indeed be preferable, we shouldn't forget to write a few .to_{x}
functions, to quickly convert data to e.g. torch geometric data. I'm curious whether storing neighours graphs in obsp
is performant to the same degree as everything natively in torch geometric format.Thanks for sharing your thoughts, I agree with most of the points.
labels
, the reason why I don't use tables
is because tables
slot is still not lazy at the current stage, if the feature matrix is big, it will take a long time to initiate a data object. I'm considering images
slot for easy accessing the data.WSIData
into datasets instead of implementing class methods, this way we will have more extensibility for future unseen use cases. I'm considering adopting the huggingface datasets, which can be easily adapted to different deep-learning frameworks (tf, torch, jax, ...) by calling the .with_format('tf')
for example.However, if we think feature matrix for WSI as gene expression in spatial omics. It actually make more sense to save them in tables
.
The current design of the data structure to represent WSI contains several drawbacks. This issue is to discuss if there are better solutions or if the current one is already acceptable.
Another thing to discuss here is if we should ship the
WSIData
as a separated package, think of anndata and scanpy.Design
We use a
WSIData
class that abstracts two thingsSpatialData
Image
WSIData
interacts with WSI using the reader. However, the WSI can load as aDataTree
which is the pyramids structure in xarray and it's lazy, but this will raise the problem when saving the SpatialData on disk, theDataTree
will need to make a copy of the uncompressed slide data, this could result in crazy disk space usage. Currently, the WSIData will not create aDataTree
representation of the slide file by default.wsi_thumbnails
which is small on disk and memory is attached to theSpatialData
object when reading the slide for the first time.Pros
Cons
SpatialData
fileTissue contours
Attach in
shapes
slot, the tissue holes should be passed toholes
parameters when constructingPolygon
.Tiles
Attach in
shapes
slotTiles are saved as polygon, the tiles table should also record the
x
andy
in image coordinates at level 0.A
tile_spec
should be recorded.Feature Matrix
This is always a 2D array, attach in
images
slotTile neighbors sparse matrix
Save in the
obsp
slot of anAnnData
object in thetables
slot ofSpatialData