Closed schoonhovenrichard closed 3 weeks ago
Do you observe this behavior for all samples? just TENX95?
Thanks for responding! I tested TENX105 (which is smaller in size) and there it worked fine. So I think it is related to the size of the data. Perhaps this should be viewed more as a SpatialData problem and not one for HEST. If you know of a way to set the chunking behaviour for this type of data it would be nice, but maybe this is something SpatialData should tackle on their side when they use zarr.
In the meantime, I have decided to work with the hest library itself to retrieve the cell masks. If it's okay, for future users I'll post the code here how to use this library to get segmentation masks
Read the files:
from hest import iter_hest
idx = 'TENX95'
for st in iter_hest(data_dir, id_list=[idx], load_transcripts=True):
...
Or a manual version:
from hest.HESTData import read_HESTData
root_dir = ...
st = read_HESTData(
adata_path=f'{root_dir}/st/{idx}.h5ad',
img=f'{root_dir}/wsis/{idx}.tif',
metrics_path=f'{root_dir}/metadata/{idx}.json',
cellvit_path=f'{root_dir}/cellvit_seg/{idx}_cellvit_seg.parquet',
tissue_contours_path=f'{root_dir}//tissue_seg/{idx}_contours.geojson',
xenium_cell_path=f'{root_dir}/xenium_seg/{idx}_xenium_cell_seg.parquet',
xenium_nucleus_path=f'{root_dir}/xenium_seg/{idx}_xenium_nucleus_seg.parquet',
transcripts_path=f'{root_dir}/transcripts/{idx}_transcripts.parquet')
Next we can get the shapes. The st.shapes
list is of the order [cellvit, xenium_cell, xenium_nucleus] which each are a dataframe of Shapely polygons. However, if some of these three files are not passed or do not exist they won't be in the list. Therefore, a better way is to query them using get_shapes
:
# st.shapes is a list of [cellvit, xenium_cell, xenium_nucleus]
# We can query it like this (which segmentation, coordinate_system)
shapes = st.get_shapes('xenium_cell', 'he').shapes
Next, I extract the masks per patch by using an roi. For example,
# Get WSI and crop it
wsi = st.wsi
start = 7000
end = 9048
roi = wsi.numpy()[start:end,start:end,:]
# Filter transcripts to ROI only
transcripts = st.transcript_df
unique_cell_ids = transcripts['cell_id'].unique()
condition_x = (transcripts['he_x'] >= start) & (transcripts['he_x'] < end)
condition_y = (transcripts['he_y'] >= start) & (transcripts['he_y'] < end)
valid_entries = condition_x & condition_y
valid_cell_ids = transcripts[valid_entries].groupby('cell_id').filter(lambda x: len(x) == len(transcripts[transcripts['cell_id'] == x.name]))['cell_id'].unique()
filtered_df = transcripts[transcripts['cell_id'].isin(valid_cell_ids)]
Next, filter the masks/shapes to only be within the ROI:
# Filter masks to only be within ROI
unique_cell_ids_set = set(unique_cell_ids)
shapes = st.get_shapes('xenium_cell', 'he').shapes
filtered_xenium_cell = shapes.loc[shapes.index.isin(unique_cell_ids_set)]
shapes = st.get_shapes('xenium_nucleus', 'he').shapes
filtered_xenium_nuc = shapes.loc[shapes.index.isin(unique_cell_ids_set)]
shapes = st.get_shapes('cellvit', 'he').shapes
filtered_cellvit = shapes.loc[shapes.index.isin(unique_cell_ids_set)]
We can plot the result to see that we did a good job:
Finally, we can convert the filtered shapes to a mask tensor. Here it's just a binary mask but you should assign the cell types you want.
def create_segmentation_mask(image_shape, xcdf, unique_ids, start=7000):
# Create an empty mask array with the same shape as the image
mask = np.zeros(image_shape, dtype=np.int32)
# Assign a unique integer value to each polygon (based on unique IDs)
for poly_id, (poly, unique_id) in enumerate(zip(xcdf['geometry'], unique_ids), start=1):
if isinstance(poly, ShapelyPolygon):
# Extract the x and y coordinates of the polygon and convert to numpy arrays
x, y = np.array(poly.exterior.xy[0]), np.array(poly.exterior.xy[1])
# Adjust the coordinates based on the start offset
x = x - start
y = y - start
# Ensure the coordinates are within bounds of the image shape
x = np.clip(x, 0, image_shape[1] - 1)
y = np.clip(y, 0, image_shape[0] - 1)
# Get the rows and columns of the pixels inside the polygon using skimage.draw.polygon
rr, cc = polygon(y, x, shape=image_shape)
# Assign a unique value to the pixels inside the polygon
mask[rr, cc] = 1
return mask
And we see it worked:
I hope someone in the future maybe finds this useful.
@schoonhovenrichard, see the fix introduced in #63. By default, we set the argument "full-res" to False
. Writing a spatial data obj can take a lot of time and is probably not required for some basic viz. Let me know!
@guillaumejaume Thanks for pointing this out. That indeed probably would resolve the errors I was getting for the larger files. Again thanks for the quick responses, and feel free to close the issue if you want.
I want to experiment with training standard segmentation networks to annotate cells (semantic segmentation) on your Xenium data, and I found the following tutorial using a SpatialData object (https://spatialdata.scverse.org/en/stable/tutorials/notebooks/notebooks/examples/densenet.html) for some initial first steps on how to define a PyTorch dataset object. Therefore, I tried to use the
to_spatial_data()
function. However, when I run the following code:It throws the error:
ValueError: Codec does not support buffers of > 2147483647 bytes
. I understand that this error ultimately comes from SpatialData, but since this function is supported here I was hoping you maybe have experience with how to convert and store this Xenium data. Is there a way to for example make the 'chunks' that the zarr writer receives smaller to avoid this error?Alternatively, is there a better way to do a simple test for cell segmentation for the HEST Xenium data? Perhaps I missed a tutorial notebook where I can go around using SpatialData alltogether and use the hest library directly. From my experience, I ultimately need a tensor of masks of the same size as the image (in practice both the masks and the image will be patched to a smaller size) to use as targets for training semantic segmentation networks, but I am not sure how to create those from the Xenium data.
Thanks in advance for any potential pointers on how to use this data for this purpose.