How to use Xenium data for cell segmentations?

schoonhovenrichard commented 1 month ago

I want to experiment with training standard segmentation networks to annotate cells (semantic segmentation) on your Xenium data, and I found the following tutorial using a SpatialData object (https://spatialdata.scverse.org/en/stable/tutorials/notebooks/notebooks/examples/densenet.html) for some initial first steps on how to define a PyTorch dataset object. Therefore, I tried to use the to_spatial_data() function. However, when I run the following code:

for st in iter_hest(data_dir, id_list=['TENX95'], load_transcripts=True):                                                             
    # Conversion to SpatialData                                                
    sdata = st.to_spatial_data()
    sdata.write(save_dir + '/SpatialData_TENX95.zarr')

It throws the error: ValueError: Codec does not support buffers of > 2147483647 bytes. I understand that this error ultimately comes from SpatialData, but since this function is supported here I was hoping you maybe have experience with how to convert and store this Xenium data. Is there a way to for example make the 'chunks' that the zarr writer receives smaller to avoid this error?

Alternatively, is there a better way to do a simple test for cell segmentation for the HEST Xenium data? Perhaps I missed a tutorial notebook where I can go around using SpatialData alltogether and use the hest library directly. From my experience, I ultimately need a tensor of masks of the same size as the image (in practice both the masks and the image will be patched to a smaller size) to use as targets for training semantic segmentation networks, but I am not sure how to create those from the Xenium data.

Thanks in advance for any potential pointers on how to use this data for this purpose.

guillaumejaume commented 1 month ago

Do you observe this behavior for all samples? just TENX95?

schoonhovenrichard commented 4 weeks ago

Thanks for responding! I tested TENX105 (which is smaller in size) and there it worked fine. So I think it is related to the size of the data. Perhaps this should be viewed more as a SpatialData problem and not one for HEST. If you know of a way to set the chunking behaviour for this type of data it would be nice, but maybe this is something SpatialData should tackle on their side when they use zarr.

Using HEST library to get segmentation masks from Xenium data

In the meantime, I have decided to work with the hest library itself to retrieve the cell masks. If it's okay, for future users I'll post the code here how to use this library to get segmentation masks

Read the files:

from hest import iter_hest

idx = 'TENX95'
for st in iter_hest(data_dir, id_list=[idx], load_transcripts=True):
    ...

Or a manual version:

from hest.HESTData import read_HESTData

root_dir = ...
st = read_HESTData(                                                            
    adata_path=f'{root_dir}/st/{idx}.h5ad',                                    
    img=f'{root_dir}/wsis/{idx}.tif',                                          
    metrics_path=f'{root_dir}/metadata/{idx}.json',                            
    cellvit_path=f'{root_dir}/cellvit_seg/{idx}_cellvit_seg.parquet',          
    tissue_contours_path=f'{root_dir}//tissue_seg/{idx}_contours.geojson',     
    xenium_cell_path=f'{root_dir}/xenium_seg/{idx}_xenium_cell_seg.parquet',   
    xenium_nucleus_path=f'{root_dir}/xenium_seg/{idx}_xenium_nucleus_seg.parquet',
    transcripts_path=f'{root_dir}/transcripts/{idx}_transcripts.parquet')

Next we can get the shapes. The st.shapes list is of the order [cellvit, xenium_cell, xenium_nucleus] which each are a dataframe of Shapely polygons. However, if some of these three files are not passed or do not exist they won't be in the list. Therefore, a better way is to query them using get_shapes:

# st.shapes is a list of [cellvit, xenium_cell, xenium_nucleus]            
# We can query it like this (which segmentation, coordinate_system)            
shapes = st.get_shapes('xenium_cell', 'he').shapes

Region of interest only

Next, I extract the masks per patch by using an roi. For example,

# Get WSI and crop it                                                          
wsi = st.wsi                                                                   
start = 7000                                                                   
end = 9048
roi = wsi.numpy()[start:end,start:end,:]

# Filter transcripts to ROI only                                               
transcripts = st.transcript_df                                                
unique_cell_ids = transcripts['cell_id'].unique()                             

condition_x = (transcripts['he_x'] >= start) & (transcripts['he_x'] < end)    
condition_y = (transcripts['he_y'] >= start) & (transcripts['he_y'] < end)    
valid_entries = condition_x & condition_y                                     
valid_cell_ids = transcripts[valid_entries].groupby('cell_id').filter(lambda x: len(x) == len(transcripts[transcripts['cell_id'] == x.name]))['cell_id'].unique()

filtered_df = transcripts[transcripts['cell_id'].isin(valid_cell_ids)]

Next, filter the masks/shapes to only be within the ROI:

# Filter masks to only be within ROI
unique_cell_ids_set = set(unique_cell_ids)                                     

shapes = st.get_shapes('xenium_cell', 'he').shapes
filtered_xenium_cell = shapes.loc[shapes.index.isin(unique_cell_ids_set)]  

shapes = st.get_shapes('xenium_nucleus', 'he').shapes                          
filtered_xenium_nuc = shapes.loc[shapes.index.isin(unique_cell_ids_set)]     

shapes = st.get_shapes('cellvit', 'he').shapes                                 
filtered_cellvit = shapes.loc[shapes.index.isin(unique_cell_ids_set)]

We can plot the result to see that we did a good job: overlayed_image

Convert to mask

Finally, we can convert the filtered shapes to a mask tensor. Here it's just a binary mask but you should assign the cell types you want.

def create_segmentation_mask(image_shape, xcdf, unique_ids, start=7000):                                                                              
    # Create an empty mask array with the same shape as the image              
    mask = np.zeros(image_shape, dtype=np.int32)                               

    # Assign a unique integer value to each polygon (based on unique IDs)      
    for poly_id, (poly, unique_id) in enumerate(zip(xcdf['geometry'], unique_ids), start=1):
        if isinstance(poly, ShapelyPolygon):                                   
            # Extract the x and y coordinates of the polygon and convert to numpy arrays
            x, y = np.array(poly.exterior.xy[0]), np.array(poly.exterior.xy[1])

            # Adjust the coordinates based on the start offset                 
            x = x - start                                                      
            y = y - start                                                      

            # Ensure the coordinates are within bounds of the image shape      
            x = np.clip(x, 0, image_shape[1] - 1)                              
            y = np.clip(y, 0, image_shape[0] - 1)                              

            # Get the rows and columns of the pixels inside the polygon using skimage.draw.polygon
            rr, cc = polygon(y, x, shape=image_shape)                          

            # Assign a unique value to the pixels inside the polygon           
            mask[rr, cc] = 1                                                   
    return mask

And we see it worked: colored_mask

I hope someone in the future maybe finds this useful.

guillaumejaume commented 3 weeks ago

@schoonhovenrichard, see the fix introduced in #63. By default, we set the argument "full-res" to False. Writing a spatial data obj can take a lot of time and is probably not required for some basic viz. Let me know!

schoonhovenrichard commented 3 weeks ago

@guillaumejaume Thanks for pointing this out. That indeed probably would resolve the errors I was getting for the larger files. Again thanks for the quick responses, and feel free to close the issue if you want.

mahmoodlab / HEST