gustaveroussy / sopa

Technology-invariant pipeline for spatial omics analysis (Xenium / MERSCOPE / CosMx / PhenoCycler / MACSima / Hyperion) that scales to millions of cells
https://gustaveroussy.github.io/sopa/
BSD 3-Clause "New" or "Revised" License
119 stars 11 forks source link

Scaling sopa to run many images #103

Open josenimo opened 1 month ago

josenimo commented 1 month ago

Dear @quentinblampey

I am in the process of changing the provided Snakefile to run many samples (images) at once. I am replacing the inputs and outputs paths to include the {sample} wildcard. I however run into an issue with get_input_resolve since it handles all the patches itself. Do you have any suggestions here? Should I create my own get_input_resolve to include the {sample} wildcard?

I tried replacing the paths to relative paths with wildcards, but the ChildIOException starts appearing.. Any thoughts are welcome, current snakefile below:

Snakefile ```python configfile: "ome_tif.yaml" from utils import WorkflowPaths, Args paths = WorkflowPaths(config) args = Args(paths, config) SAMPLES = ["1", "2", "3", "4"] localrules: all rule all: input: expand("data/quantification/{sample}_quant.csv", sample=SAMPLES) shell: """ echo 🎉 Successfully run sopa echo → SpatialData output directory: {paths.sdata_path} echo → Explorer output directory: {paths.explorer_directory} echo → Open the result in the explorer: 'open {paths.explorer_experiment}' """ rule prepare_image: input: raw_image= expand("data/input/{sample}.ome.tif", sample=SAMPLES) # raw_image=paths.data_path if config["read"]["technology"] != "uniform" else [] output: projection= "data/projections/{sample}_projection.ome.tif" # projection="data/projections/small_image_projection.ome.tif" conda: "sopa" params: nuclei_channels=config["projection"]["nuclei_channels"], nuclear_scaling_min_max_quantiles=config["projection"]["nuclear_scaling_min_max_quantiles"], nuclear_projection_quantile=config["projection"]["nuclear_projection_quantile"], membrane_channels=config["projection"]["membrane_channels"], membrane_scaling_min_max_quantiles=config["projection"]["membrane_scaling_min_max_quantiles"], membrane_projection_quantile=config["projection"]["membrane_projection_quantile"] shell: """ python scripts/project.py \ --input {input.raw_image} \ --output {output.projection} \ --channels_nucleus {params.nuclei_channels} \ --nucleus_min_max_scaling_quantiles {params.nuclear_scaling_min_max_quantiles} \ --nuclear_quantile {params.nuclear_projection_quantile} \ --channels_membrane {params.membrane_channels} \ --membrane_min_max_scaling_quantiles {params.membrane_scaling_min_max_quantiles} \ --membrane_quantile {params.membrane_projection_quantile} """ rule to_spatialdata: input: projected_image="data/projections/{sample}_projection.ome.tif" output: zarr_object = "data/zarrs/{sample}.zarr" conda: "sopa" resources: mem_mb=128_000, params: args_reader = str(args['read']) shell: """ sopa read {input.projected_image} --sdata-path {output.zarr_object} {params.args_reader} """ checkpoint patchify_cellpose: input: zarr_object = "data/zarrs/{sample}.zarr" output: patches_file = "data/zarrs/{sample}.zarr/.sopa_cache/patches_file_image", patches = touch("data/zarrs/{sample}.zarr/.sopa_cache/patches"), params: args_patchify = str(args["patchify"].where(contains="pixel")), conda: "sopa" shell: """ sopa patchify image {input.zarr_object} {params.args_patchify} """ rule patch_segmentation_cellpose: input: zarr_object = "data/zarrs/{sample}.zarr", patches_file = "data/zarrs/{sample}.zarr/.sopa_cache/patches_file_image", patches = "data/zarrs/{sample}.zarr/.sopa_cache/patches" output: cellpose_boundaries = "data/zarrs/{sample}.zarr/.sopa_cache/cellpose_boundaries/", #not sure if this is correct, but to signal the order of events cellpose_tmp_dir = "data/zarrs/{sample}.zarr/.sopa_cache/cellpose_boundaries/{wildcards.index}.parquet" conda: "sopa" params: args_cellpose = str(args["segmentation"]["cellpose"]), shell: """ sopa segmentation cellpose {input.zarr_object} --patch-dir {output.cellpose_tmp_dir} --patch-index {wildcards.index} {params.args_cellpose} """ def get_input_resolve(name, dirs=False): def _(wilcards): with getattr(checkpoints, f"patchify_{name}").get(**wilcards).output.patches_file.open() as f: return paths.cells_paths(f.read(), name, dirs=dirs) return _ rule resolve_cellpose: input: zarr_object = "data/zarrs/{sample}.zarr", cellpose_tmp_dir = "data/zarrs/{sample}.zarr/.sopa_cache/cellpose_boundaries/" output: touch("data/zarrs/{sample}.zarr/.sopa_cache/cellpose_boundaries_done"), conda: "sopa" shell: """ sopa resolve cellpose {input.zarr_object} --patch-dir {input.cellpose_tmp_dir} """ rule rasterize: input: zarr_object = "data/zarrs/{sample}.zarr", cellpose_boundaries = "data/zarrs/{sample}.zarr/.sopa_cache/cellpose_boundaries_done" #to ensure to run after segmentation is done output: mask_tif="data/masks/{sample}_mask.tif" conda: "sopa" shell: """ mkdir -p ./data/masks python scripts/rasterize.py \ --input {input.zarr_object} \ --output {output.mask_tif} """ rule quantify: input: raw_image= expand("data/input/{sample}.ome.tif", sample=SAMPLES), mask_tif="data/masks/{sample}_mask.tif", markers = "data/input/markers.csv" output: quantification = "data/quantification/{sample}_quant.csv" conda: "sopa" params: math = config["quantify"]["math"], quantile = config["quantify"]["quantile"], shell: """ mkdir -p ./data/quantification python scripts/quant.py \ --image {input.raw_image} \ --label {input.mask_tif} \ --markers {input.markers} \ --output {output.quantification} \ --math {params.math} \ --quantile {params.quantile} """ ```
quentinblampey commented 1 month ago

Hello,

Have you updated utils.py? I think that adding a wildcard in self.sdata_path here should do the trick, as everything else depends on this variable. But I'm not sure if you'll have a ChildIOException... Let me know how it goes!

josenimo commented 1 month ago

I have not updated utils.py, will do that :) You mean hacking the utils.py or just listing all the data_paths in snakefile and running it through that (this seems the simplest)

To be explicit:

SAMPLES = ["/data/input/1.ome.tif", "/data/input/2.ome.tif", "/data/input/3.ome.tif", "/data/input/4.ome.tif"]
rule prepare_image:
    input:
        raw_image= expand("{sample}", sample=SAMPLES)
quentinblampey commented 1 month ago

I meant something like self.sdata_path = Path("data/zarrs/{sample}.zarr") in utils.py:L43, but I'm not sure it would work

josenimo commented 1 month ago

Hey @quentinblampey,

So I have tried a couple of things. First I tried your suggestion, or trying to pass a {sample} wildcard to the paths creator. Snakemake just creates one set of paths, and thus fails, I am not sure if snakemake is compatible with this kind of object creation setup. I also tried replacing the paths object, with a .yaml file that could be read by the functions, I managed to make this work for my homebrewed functions, but did not dare to modify the sopa functions to accept a yaml file instead of a path directly. Not sure where to go from here. At the moment I am running the analysis one image at a time. The only idea I have left is to perhaps run the sopa python functions directly in the snakefile, haven't given it much thought though.

Happy to hear your comments.

quentinblampey commented 1 month ago

Hey, sorry to hear it's still not working. I have to say I still find Snakemake confusing sometimes, so I don't really know how to make this work without many trials and errors.

One question: why do you want to run everything at once? I personally prefer to run Sopa on one slide at a time; this way, if one slide fails, it will not make the other images fail.

Also, I personally use InquirerPy to prompt and execute the pipeline, which makes it very easy to run multiple slides. See this example here, which asks to choose a config and an existing data directory. With a similar idea, you could also use a multi-select to run multiple snakemake commands at once. I also made this prompt available for other users of my institute, so it's pretty easy for them to run Sopa.