scverse / spatialdata

An open and interoperable data framework for spatial omics data
https://spatialdata.scverse.org/
BSD 3-Clause "New" or "Revised" License
230 stars 43 forks source link

check on `c_coords` should be one the output data in `map_raster` #597

Open giovp opened 3 months ago

giovp commented 3 months ago

You are right, the check on c_coords should be one the output data. I merged already because I need to merge #594, which was opened against the current PR. Please can you make a patch for fixing that behavior?

Originally posted by @LucaMarconato in https://github.com/scverse/spatialdata/issues/588#issuecomment-2182817806

LucaMarconato commented 3 months ago

CC @ArneDefauw

ArneDefauw commented 3 months ago

I've added PR for this issue, see https://github.com/scverse/spatialdata/pull/599

I wonder if support should also be added to spatialdata for relabeling across chunks to avoid label collisions if going from image->labels

LucaMarconato commented 3 months ago

Thanks! That's an interesting API! It could be implemented making use of the new map_raster() right?

ArneDefauw commented 3 months ago

yes, indeed. Would you prefer adding it as a parameter to map_raster()? Say a boolean flag, relabel_blocks, ignored if output type is not a labels layer or if blockwise is False. Or would you prefer wrapping map_raster in a seperate function, which will take care of relabeling chunks if output layer is a labels layer?

Wouldn't be too difficult to implement, I can prepare a PR if you would be interested.

LucaMarconato commented 3 months ago

I think I would wrap it in a separate function, @giovp @melonora wdyt?

giovp commented 3 months ago

that sounds really useful @ArneDefauw ! I think it depends a bit, if it's just a single argument, then maybe can be done in map_raster? as in, we have already a very extensive list of functions/methods in the repo, but I also understand on striking a balance of composability and fleixiblity

LucaMarconato commented 3 months ago

Good point! If using map_raster() for that use case would feel unnatural, or if the functionality would be "too hidden" I would go for a separate function with an obvious name.

LucaMarconato commented 3 months ago

Developing more on this, I guess that signature of the separate function would be

relabel(labels: DataArray | DataTree, relabel: dict[str: str])

while the usage with map_raster() would read as

python
map_raster(data=data, func=lambda x: x, relabel: dict[str, str])

So pretty much equivalent.

I propose to:

ArneDefauw commented 3 months ago

I agree, I think we can first try to wrap it in map_raster.

@LucaMarconato , I don't think I understand the type of relabel (relabel: dict[str:str]) you propose. I think relabel can be a boolean. I was planning to implement relabeling via use of a bit shift. This is how it was done in squidpy, and is also how I typically implement it to avoid collisions when I do not know which labels will be predicted for each block. I.e. something like: ...

        num_blocks = x.numblocks
        shift = int(np.prod(num_blocks[0] * num_blocks[1] * num_blocks[2]) - 1).bit_length()

...

        block_num = block_id[0] * (num_blocks[1] * num_blocks[2]) + block_id[1] * (num_blocks[2]) + block_id[2]
        mask: NDArray = labels > 0
        labels[mask] = (labels[mask] << shift) | block_num
giovp commented 3 months ago

I think it sounds great @ArneDefauw ! I agree on the type of relabel, I think maybe one thing would have to be checked if the label image would be annotated by a table, whether the segmentation mask ids match

LucaMarconato commented 3 months ago

I believe we are thinking of different relabeling use cases.

my interpretation

What I was thinking is of having the relabeling operation as a function that takes as input a mapping between the old labels values and the new label values, with the extra assumption that the mapping is injective. The type would be dict[int, int] (rather than dict[str, str] as I mentioned erroneously before).

The idea is that we could have the unique values [0, 1, 3, 5] in the array, and if the dict is {0: 0, 1: 1, 3: 2, 5: 3}, now we obtain as unique values [0, 1, 2, 3].

The relabeling function is straightforward to implement, and I guess the most common use case would be to give a set of indices without gaps, like in the example above (where the gap between 1, 3 and 3, 5, disappears).

your interpretation

I haven't understood the use case for your code, could you please explain it? What are the properties of the new relabeling? Furthermore, I have two comments. 1) With the code above I think that if you have two blocks that contain pixels with the same values (let's say the label 0x000000ff); I think that after the relabeling, this number could become, say, 0x00000fff in one block, but could be a different number, for instance 0x0000f0ff in another block due to the different bits added with | block_num (in my example block_num is 0x00000f00 and 0x0000f000 respectively. 2) Does your code guarantee that the relabeling is injective? I could imagine that if block_num is something like 0xffffffff, then all the labels would collapse to that number.

final comment

I think it sounds great @ArneDefauw ! I agree on the type of relabel, I think maybe one thing would have to be checked if the label image would be annotated by a table, whether the segmentation mask ids match

I agree, if the relabeling is performed we could also take care of ensuring the annotating table still match.

giovp commented 3 months ago

The relabelling is needed because the segmentation masks ids are unique within each block, but they are not connected in the global mask, hence you need to relabel based on the connected segmentation masks across blocks, you can see how it is done in squidpy here: https://github.com/scverse/squidpy/blob/7c5c60c47316f3f75f2d73aa684d13bccfbc2662/src/squidpy/im/_segment.py#L105-L135

I am not sure there is a realistic use case when the user wants to pass a set of segmentation masks ids used to relabel the existing ids. In that case however, it could simply be a lambda function + map_blocks, but the relabelling approach from @ArneDefauw would still be needed.

ArneDefauw commented 3 months ago

Now I understand the confusion @LucaMarconato , we are indeed thinking of different use cases :).

I think the following use cases can be isolated.

1) labels_layer -> labels_layer. Relabeling an existing labels layer. Your use case is providing some utility function (or wrap it in map_raster) to relabel a labels layer. Indeed, relabeling to sequential labels would be the most common use case. A bit similar to the scikit image utility function skimage.segmentation.relabel_sequential, but then implemented for dask arrays. I don't think dask-image implements such function, so there could be a use case for it. I don't know if people would like more fine grained control over this relabeling apart from a sequential relabeling, so I am not sure if we should support passing a dict.

2) labels_layer -> labels_layer. A func is passed to map_raster that does not alter id's of the labels. E.g. expansion of the masks, or filtering of masks (for example filtering out cells that are too small or too big). I can not think of an operation that would need to alter id's of the labels, so relabeling as I propose in use case 3 would probably not be necessary if you have a single labels layer, but we can make it optional, even in this case.

3) image_layer -> labels_layer (segmentation), multiple labels_layers -> labels_layer ( merging of different labels layers to a single labels layer).

a) Segmentation. If your callable converts image layers to labels layers, you are basically doing segmentation, and you need a way to handle the predictions done by e.g. cellpose on each block. Cellpose, or any segmentation algorithm will predict labels ranging from 1 to say 100 on each block. So different cells in different blocks can be assigned the same label, so hence you need a way to make the labels unique across blocks.

'Does your code guarantee that the relabeling is injective? I could imagine that if block_num is something like 0xffffffff, then all the labels would collapse to that number.'' No, collision between labels in different chunks is a theoretical possibility. I think a worst case scenario would be that you have max_uint16+1 chunks (I assume we work in uint32).

max_uint16 = np.iinfo(np.uint16).max
shift=int(max_uint16-1).bit_length() # shift is 16 in this case

And if you then would have labels ranging up to max_uint16+1 in a block, then they would be mapped to 0 if you do max_uint16<<16. But is this not merely theoretical? max_unint16=65535, which is a huge amount of chunks. I think solution would be to put a sanity assert on the number of blocks and the max label in a block. We can check if collisions would happend, and throw an error message.

As @giovp mentioned, after this relabeling, you still need a way to connect them in the global mask. I found that squidpy's way of connecting them, using dask-image functions like label_adjancency_graph, leads to issues for dense cell regions . So either we choose another method, e.g. as in https://github.com/gletort/NeubiasPasteur2023_AdvancedCellPose/blob/main/distributed_segmentation.py (this method worked better); or we let the user connect the predicted mask by the algorithm of choice. I.e., we let the user solve the chunking artifacts on the border of the chunks.

b) Merging labels layers, i.e. [ label_layer_1, label_layer_2 ] -> merged_label_layer is currently not supported by map_raster because we do not allow passing more than one DataArray. But you would want to handle relabeling in same way as in a.

LucaMarconato commented 3 months ago

Ok thank you for the explanation, now it's clear.

I don't know if people would like more fine grained control over this relabeling apart from a sequential relabeling, so I am not sure if we should support passing a dict.

I would default for the sequential labeling then. We are thinking of having a package/some code that deals with image feature extraction. This function could live in such a package and be implemented, as Giovanni suggested, by passing the correct lambda function to map_raster(). (CC @melonora. TLDR; since it's a long issue: you can skip all the text above, in short, we could have a function in the raster manipulation package to relabel existing labels objects. You can stop reading now :D ).

LucaMarconato commented 3 months ago

But is this not merely theoretical? max_unint16=65535, which is a huge amount of chunks. I think solution would be to put a sanity assert on the number of blocks and the max label in a block. We can check if collisions would happend, and throw an error message.

You are right, it would be ok.

LucaMarconato commented 3 months ago

As @giovp mentioned, after this relabeling, you still need a way to connect them in the global mask. I found that squidpy's way of connecting them, using dask-image functions like label_adjancency_graph, leads to issues for dense cell regions . So either we choose another method, e.g. as in gletort/NeubiasPasteur2023_AdvancedCellPose@main/distributed_segmentation.py (this method worked better); or we let the user connect the predicted mask by the algorithm of choice. I.e., we let the user solve the chunking artifacts on the border of the chunks.

If I am getting the problem right, you are trying to have a way to detect cells that overlapped to a block border, and make sure that their label value is the same globally.

In such a case, in SOPA, the external segmentation algorithms return polygons/multipolygons (or when they return raster, the segmentation is converted to polygons/multipolygons), and then a conflict resolution strategy is applied to the borders. IIRC basically with the same short code shown here (or with some equivalent code): https://github.com/scverse/spatialdata/blob/330caa09aaedf8b3d71233a67ffc206031b241b5/src/spatialdata/_core/operations/vectorize.py#L251.

Practically speaking, I would not worry about connecting adjacent pixels from different blocks here. The user can always do this manually or rely on strategies such as the one implemented in SOPA, to achieve that.

Please let me know what you think about this.

giovp commented 3 months ago

This function could live in such a package and be implemented, as Giovanni suggested, by passing the correct lambda function to map_raster(). (CC @melonora. TLDR; since it's a long issue: you can skip all the text above, in short, we could have a function in the raster manipulation package to relabel existing labels objects. You can stop reading now :D ).

mmh I don't think this requires a separate package, since this processing step might be required right after a map_raster operation. Yes, it could also be needed in other methods that are not currently implemented, but my understanding from @ArneDefauw is that this is desirable in some instances of map_raster usage and so it could be added directly there

LucaMarconato commented 3 months ago

@giovp I meant the relabeling of raster types that I described under "my intepretation" here https://github.com/scverse/spatialdata/issues/597#issuecomment-2210892148 (sorry for the confusion).

Instead the fix described under "your interpretation" should indeed be implemented here, and I am now convinced that the byte shifting approach would be good.

In conclusion, @ArneDefauw if you could implement the bit shifting approach that would be amazing! And to regard to the "relabeling of segmentation masks as of my interpretation", I'd skip the implementation for now.

ArneDefauw commented 3 months ago

@giovp I meant the relabeling of raster types that I described under "my intepretation" here #597 (comment) (sorry for the confusion).

Instead the fix described under "your interpretation" should indeed be implemented here, and I am now convinced that the byte shifting approach would be good.

In conclusion, @ArneDefauw if you could implement the bit shifting approach that would be amazing! And to regard to the "relabeling of segmentation masks as of my interpretation", I'd skip the implementation for now.

Ok, I will implement the bit shifting approach! Should not be too difficult to implement, i can recycle some code here and there :)

Re connecting adjacent pixels from different blocks. I have not tested the method implemented by SOPA, but in my opinion having to convert to a shapes layer to fix these chunking artifacts is an unnecessary conversion step; these chunking arifacts can be fixed working on raster data only, and prevents having to go back and forth between a shapes layer and a labels layer. Also, I wonder if a z dimension is supported if you convert to a shapes layer to fix these artefacts. We can skip the implementation for now, but I would be happy to pick this up at a later time, if required necessary. I would then propose to put it in a separate function map_segment, which would take care of scaling any Callable that converts images to labels using map_overlap, while taking care of relabeling (bit shift) and fixing chunking artefacts.

So if I understand correctly, for now we do not implement the optional sequential relabeling?

LucaMarconato commented 3 months ago

Ok, I will implement the bit shifting approach! Should not be too difficult to implement, i can recycle some code here and there :)

Thanks a lot!

I would then propose to put it in a separate function map_segment, which would take care of scaling any Callable that converts images to labels using map_overlap,

Nice, let's keep this open for a follow up PR. I'd go for the name map_labels at that point, but let's think about this later on.

So if I understand correctly, for now we do not implement the optional sequential relabeling?

Maybe let's have this just as internal function but I would not expose this in a separate API now.