scverse / spatialdata

An open and interoperable data framework for spatial omics data
https://spatialdata.scverse.org/
BSD 3-Clause "New" or "Revised" License
237 stars 45 forks source link

fix for to_polygons when using processes instead of threads in dask #756

Open ArneDefauw opened 2 weeks ago

ArneDefauw commented 2 weeks ago

I noticed that when configuring dask to use 'processes' instead of 'threads' , spatialdata.to_polygons fails due to this line:

https://github.com/scverse/spatialdata/blob/27bb4a7579d8ff7cc8f6dd9b782226cb984ceb20/src/spatialdata/_core/operations/vectorize.py#L212

result of dask.compute()is lost when using processes, which is expected behaviour.

Note that using 'processes' instead of 'threads' considerably speeds up vectorizing labels for large masks, because the function we try to parallelize does not release the GIL.

Example:

import os
from pathlib import Path

import pooch
from pooch import Pooch

from spatialdata import read_zarr

BASE_URL = "https://objectstor.vib.be/spatial-hackathon-public/sparrow/public_datasets"

def _get_registry(path: str | Path | None = None) -> Pooch:
    return pooch.create(
        path=pooch.os_cache("sparrow") if path is None else path,
        base_url=BASE_URL,
        version="0.0.1",
        registry={
            "transcriptomics/vizgen/mouse/_sdata_2D.zarr.zip": "e1f36061e97e74ad131eb709ca678658829dc4385a444923ef74835e783d63bc",
        },
    )

registry=_get_registry( path = None ) # set path if you want to download data to somewhere else
unzip_path = registry.fetch("transcriptomics/vizgen/mouse/_sdata_2D.zarr.zip", processor=pooch.Unzip())
sdata = read_zarr(os.path.commonpath(unzip_path))
sdata.path = None
import dask
from spatialdata import to_polygons

dask.config.set(scheduler="processes")
gdf=to_polygons( sdata[ "segmentation_mask_full" ] )
# finishes in around 3m locally on a mac m2

dask.config.set(scheduler="threads")
gdf=to_polygons( sdata[ "segmentation_mask_full" ] )
# finishes in around 8m locally on a mac m2

"segmentation_mask_full' contains the masks from a merscope experiment, around 300k labels.

codecov[bot] commented 2 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 91.89%. Comparing base (27bb4a7) to head (d7c91a1).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #756 +/- ## ========================================== - Coverage 91.89% 91.89% -0.01% ========================================== Files 45 45 Lines 6919 6918 -1 ========================================== - Hits 6358 6357 -1 Misses 561 561 ``` | [Files with missing lines](https://app.codecov.io/gh/scverse/spatialdata/pull/756?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scverse) | Coverage Δ | | |---|---|---| | [src/spatialdata/\_core/operations/vectorize.py](https://app.codecov.io/gh/scverse/spatialdata/pull/756?src=pr&el=tree&filepath=src%2Fspatialdata%2F_core%2Foperations%2Fvectorize.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scverse#diff-c3JjL3NwYXRpYWxkYXRhL19jb3JlL29wZXJhdGlvbnMvdmVjdG9yaXplLnB5) | `93.75% <100.00%> (-0.04%)` | :arrow_down: |