gustaveroussy / sopa

Technology-invariant pipeline for spatial omics analysis (Xenium / Visium HD / MERSCOPE / CosMx / PhenoCycler / MACSima / ...) that scales to millions of cells
https://gustaveroussy.github.io/sopa/
BSD 3-Clause "New" or "Revised" License
123 stars 14 forks source link

snakemake pipeline on subset of FOVs #65

Closed cstrlln closed 4 months ago

cstrlln commented 4 months ago

Hello, I was trying to test sopa with a single FOV from my dataset in order to see everything is working, to do so I created separate file versions of each of the CosMX output files for a single FOV i.e. FOV99 in this case. I'm not sure what the right naming or strategy would be to subset a dataset prior to input in sopa or if there is a way to select which FOV specifically to use with the snakemake pipeline. This is how the beginning of each file looks like:

This is the transcripts file

tx_99 fov cell_ID cell x_local_px y_local_px x_global_px y_global_px z 1: 99 0 c_1_99_0 85 43 24039.4 22226.24 8 2: 99 0 c_1_99_0 121 21 24075.4 22248.24 9 3: 99 0 c_1_99_0 259 32 24213.4 22237.24 9 4: 99 0 c_1_99_0 205 67 24159.4 22202.24 7 5: 99 0 c_1_99_0 334 87 24288.4 22182.24 9


712596: 99 2270 c_1_99_2270 1438 4239 25392.4 18030.24 8 712597: 99 2270 c_1_99_2270 1454 4239 25408.4 18030.24 7 712598: 99 2270 c_1_99_2270 1441 4240 25395.4 18029.24 9 712599: 99 2270 c_1_99_2270 1451 4241 25405.4 18028.24 8 712600: 99 2270 c_1_99_2270 1461 4243 25415.4 18026.24 9 target CellComp 1: TNFSF13 None 2: Ptk6 None 3: Mfap5 None 4: Clec4d None 5: Tnfsf8 None


712596: Krt8 Nuclear 712597: Pdgfra Nuclear 712598: Ldha Nuclear 712599: Cd33 Nuclear 712600: Ppia Nuclear

This is the FOV locations file

fov_99 Slide X_mm Y_mm Z_mm ZOffset_mm ROI FOV Order Run_Tissue_name 1: 1 2.881258 2.678565 0.021493 -2 0 99 33 mw_mus_p1_04

When I try to run sopa with snakemake I get the following errors:

(sopa-snake) carlos@dhcp-32-058 sopa % snakemake --config data_path=/Users/carlos/projects/sopa_test_1/data_fov99 --configfile=/Users/carlos/projects/sopa_test_1/config_files/cellpose_baysor_localmodel.yaml --cores 2 --use-conda SpatialData object path set to default: /Users/carlos/projects/sopa_test_1/data_fov99.zarr To change this behavior, provide --config sdata_path=... when running the snakemake pipeline Building DAG of jobs... Using shell: /bin/bash Provided cores: 2 Rules claiming more threads will be scaled down. Job stats: job count


aggregate 1 all 1 explorer 1 image_write 1 patchify_baysor 1 patchify_cellpose 1 report 1 resolve_baysor 1 resolve_cellpose 1 to_spatialdata 1 total 10

Select jobs to execute...

[Tue May 14 15:34:52 2024] rule to_spatialdata: input: /Users/carlos/projects/sopa_test_1/data_fov99 output: /Users/carlos/projects/sopa_test_1/data_fov99.zarr/.zgroup jobid: 4 reason: Missing output files: /Users/carlos/projects/sopa_test_1/data_fov99.zarr/.zgroup resources: tmpdir=/var/folders/2n/b_m3klds26l3_d3qlqd7qtlm0000gq/T, mem_mb=128000, mem_mib=122071

Activating conda environment: sopa-snake ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /Users/carlos/mambaforge/envs/sopa-snake/lib/python3.10/site-packages/pandas/core/indexes/base.p │ │ y:3805 in get_loc │ │ │ │ 3802 │ │ """ │ │ 3803 │ │ casted_key = self._maybe_cast_indexer(key) │ │ 3804 │ │ try: │ │ ❱ 3805 │ │ │ return self._engine.get_loc(casted_key) │ │ 3806 │ │ except KeyError as err: │ │ 3807 │ │ │ if isinstance(casted_key, slice) or ( │ │ 3808 │ │ │ │ isinstance(casted_key, abc.Iterable) │ │ │ │ ╭───────────────────────────────────── locals ──────────────────────────────────────╮ │ │ │ casted_key = 'X_mm' │ │ │ │ key = 'X_mm' │ │ │ │ self = Index(['Slide', 'Y_mm', 'Z_mm', 'ZOffset_mm', 'ROI', 'FOV', 'Order', │ │ │ │ │ 'Run_Tissue_name'], │ │ │ │ │ dtype='object') │ │ │ ╰───────────────────────────────────────────────────────────────────────────────────╯ │ │ │ │ in pandas._libs.index.IndexEngine.get_loc:167 │ │ │ │ in pandas._libs.index.IndexEngine.get_loc:196 │ │ │ │ in pandas._libs.hashtable.PyObjectHashTable.get_item:7081 │ │ │ │ in pandas._libs.hashtable.PyObjectHashTable.get_item:7089 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ KeyError: 'X_mm'

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /Users/carlos/projects/sopa/sopa/cli/app.py:98 in read │ │ │ │ 95 │ │ io, technology │ │ 96 │ ), f"Technology {technology} unknown. Currently available: xenium, merscope, cosmx, │ │ 97 │ │ │ ❱ 98 │ sdata = getattr(io, technology)(data_path, *kwargs) │ │ 99 │ io.write_standardized(sdata, sdata_path, delete_table=True) │ │ 100 │ │ 101 │ │ │ │ ╭──────────────────────────────────────── locals ─────────────────────────────────────────╮ │ │ │ config_path = None │ │ │ │ data_path = '/Users/carlos/projects/sopa_test_1/data_fov99' │ │ │ │ io = <module 'sopa.io' from '/Users/carlos/projects/sopa/sopa/io/init.py'> │ │ │ │ kwargs = {} │ │ │ │ Path = <class 'pathlib.Path'> │ │ │ │ sdata_path = PosixPath('/Users/carlos/projects/sopa_test_1/data_fov99.zarr') │ │ │ │ technology = 'cosmx' │ │ │ ╰─────────────────────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /Users/carlos/projects/sopa/sopa/io/reader/cosmx.py:54 in cosmx │ │ │ │ 51 │ image_models_kwargs, imread_kwargs = _default_image_kwargs(image_models_kwargs, imre │ │ 52 │ │ │ 53 │ dataset_id = _infer_dataset_id(path, dataset_id) │ │ ❱ 54 │ fov_locs = _read_cosmx_fov_locs(path, dataset_id) │ │ 55 │ fov_id, fov = _check_fov_id(fov) │ │ 56 │ │ │ 57 │ protein_dir_dict = {} │ │ │ │ ╭───────────────────────────────────── locals ─────────────────────────────────────╮ │ │ │ dataset_id = '/Users/carlos/projects/sopa_test_1/data_fov99/F099' │ │ │ │ fov = None │ │ │ │ image_models_kwargs = {'chunks': (1, 1024, 1024), 'scale_factors': [2, 2, 2, 2]} │ │ │ │ imread_kwargs = {} │ │ │ │ path = PosixPath('/Users/carlos/projects/sopa_test_1/data_fov99') │ │ │ │ read_proteins = False │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /Users/carlos/projects/sopa/sopa/io/reader/cosmx.py:163 in _read_cosmx_fov_locs │ │ │ │ 160 │ │ │ 161 │ pixel_size = 0.120280945 # size of a pixel in microns │ │ 162 │ │ │ ❱ 163 │ fov_locs["xmin"] = fov_locs["X_mm"] 1e3 / pixel_size │ │ 164 │ fov_locs["xmax"] = 0 # will be filled when reading the images │ │ 165 │ │ │ 166 │ fov_locs["ymin"] = 0 # will be filled when reading the images │ │ │ │ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │ │ │ dataset_id = '/Users/carlos/projects/sopa_test_1/data_fov99/F099' │ │ │ │ fov_file = PosixPath('/Users/carlos/projects/sopa_test_1/data_fov99/F099_fov_positions_fi… │ │ │ │ fov_locs = │ │ Slide Y_mm Z_mm ZOffset_mm ROI FOV Order │ │ │ │ Run_Tissue_name │ │ │ │ X_mm │ │ │ │ 2.881258 1 2.678565 0.021493 -2 0 99 33 │ │ │ │ mw_mus_p1_04 │ │ │ │ path = PosixPath('/Users/carlos/projects/sopa_test_1/data_fov99') │ │ │ │ pixel_size = 0.120280945 │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /Users/carlos/mambaforge/envs/sopa-snake/lib/python3.10/site-packages/pandas/core/frame.py:4102 │ │ in getitem │ │ │ │ 4099 │ │ if is_single_key: │ │ 4100 │ │ │ if self.columns.nlevels > 1: │ │ 4101 │ │ │ │ return self._getitem_multilevel(key) │ │ ❱ 4102 │ │ │ indexer = self.columns.get_loc(key) │ │ 4103 │ │ │ if is_integer(indexer): │ │ 4104 │ │ │ │ indexer = [indexer] │ │ 4105 │ │ else: │ │ │ │ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │ │ │ is_mi = False │ │ │ │ is_single_key = True │ │ │ │ key = 'X_mm' │ │ │ │ self = │ │ Slide Y_mm Z_mm ZOffset_mm ROI FOV Order │ │ │ │ Run_Tissue_name │ │ │ │ X_mm │ │ │ │ 2.881258 1 2.678565 0.021493 -2 0 99 33 │ │ │ │ mw_mus_p1_04 │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /Users/carlos/mambaforge/envs/sopa-snake/lib/python3.10/site-packages/pandas/core/indexes/base.p │ │ y:3812 in get_loc │ │ │ │ 3809 │ │ │ │ and any(isinstance(x, slice) for x in casted_key) │ │ 3810 │ │ │ ): │ │ 3811 │ │ │ │ raise InvalidIndexError(key) │ │ ❱ 3812 │ │ │ raise KeyError(key) from err │ │ 3813 │ │ except TypeError: │ │ 3814 │ │ │ # If we have a listlike key, _check_indexing_error will raise │ │ 3815 │ │ │ # InvalidIndexError. Otherwise we fall through and re-raise │ │ │ │ ╭───────────────────────────────────── locals ──────────────────────────────────────╮ │ │ │ casted_key = 'X_mm' │ │ │ │ key = 'X_mm' │ │ │ │ self = Index(['Slide', 'Y_mm', 'Z_mm', 'ZOffset_mm', 'ROI', 'FOV', 'Order', │ │ │ │ │ 'Run_Tissue_name'], │ │ │ │ │ dtype='object') │ │ │ ╰───────────────────────────────────────────────────────────────────────────────────╯ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ KeyError: 'X_mm' [Tue May 14 15:34:55 2024] Error in rule to_spatialdata: jobid: 4 input: /Users/carlos/projects/sopa_test_1/data_fov99 output: /Users/carlos/projects/sopa_test_1/data_fov99.zarr/.zgroup conda-env: sopa-snake shell:

    sopa read /Users/carlos/projects/sopa_test_1/data_fov99 --sdata-path /Users/carlos/projects/sopa_test_1/data_fov99.zarr --technology "cosmx"

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-05-14T153451.958303.snakemake.log


This is the structure of the folder I use for sending to sopa

image

quentinblampey commented 4 months ago

Hello @cstrlln,

Did you manually subset these files, or did you export this FOV directly from AtomX? According to the error, it seems that your F099_fov_positions_file.csv.gz file is missing the column "X_mm", even though it indeed seems to contain "Y_mm" and "Z_mm". Can you check if it contains this column?

Actually, you don't need to subset your data. With Sopa you can decide to work on one single FOV you want. With the snakemake pipeline, you need to update the config file as below to read only the FOV 99. So you can keep the full data, and this will only run the pipeline on FOV 99:

read:
  technology: cosmx
  kwargs:
    fov: 99
cstrlln commented 4 months ago

Thanks for the feedback. I do have still the X_mm column, only did a manual subset based on the FOV. I wonder if the csv writing did some weird thing with that column? Any other way I can check the input, maybe in a stepwise manner?

That's good to know for the fov for snakemake. I do envisage doing manual subset datasets though because it is more portable for sharing with colleagues, as our in house toy dataset. We have several FOVs that will not be usable so not worth to keep moving around that data.

quentinblampey commented 4 months ago

Actually, I think the issue is that there is a missing "FOV" column in the F099_fov_positions_file.csv.gz file. Normally, a file location should look like the one below, and we use the FOV column as an index. In your case, since your second column is "X_mm", it is used as an index, and therefore it's not inside the columns anymore.

Slide,FOV,X_mm,Y_mm
3,1,0.722208817050781,13.7198104736064
3,2,1.23412451267148,13.7198104736064
3,3,1.74604020829218,13.7198104736064
3,4,2.25795590391288,13.7198104736064

TL;DR: can you make sure your F099_fov_positions_file.csv.gz file looks like this?

Slide,FOV,X_mm,Y_mm
1,99,2.881258,2.678565
cstrlln commented 4 months ago

So it has to be in that specific order? the cosmx file has this headings

image

Because I have subsetted the FOV locations files and is one single line now, would that be an issue?

quentinblampey commented 4 months ago

Indeed, I was assuming that the FOV column was always the second one. I don't understand why the columns are shuffled, I'm surprised that the CosMX exports are so inconsistent. Anyway, I should update the reader to deal with this.

I'll soon make a new release to handle this, I'll let you know!

cstrlln commented 4 months ago

Yes, again cosmx changing data outputs.

So that I can test this before a new release, could you point me to the expected order of columns for the required files? I don't mind re-organizing for this one.

Carlos

cstrlln commented 4 months ago

Also, if you let me know what files, I can send you recent versions of the output files to have as reference.

quentinblampey commented 4 months ago

It's only the FOV column, it should be the second column, like this:

Slide,FOV,X_mm,Y_mm
1,99,2.881258,2.678565

The order for the other columns doesn't matter

"Also, if you let me know what files I can send you recent versions of the output files to have as reference."

Sorry, I'm not sure to understand what you want?

quentinblampey commented 4 months ago

Really sorry for all this mess with CosMX data. The reader is still quite new, but I try to make it more stable, and I will try to add details on the docs for using the right export module

cstrlln commented 4 months ago

No worries, clearly CosMX is really poor at documenting outputs.

What I was proposing is I can send you examples of current versions of CosMX outputs so that you can have as reference.

quentinblampey commented 4 months ago

Ah, yes, it would be great, thanks! I would need the _fov_positions_file, and the _tx_file. Here is my email address in case you need it: quentin.blampey@centralesupelec.fr

quentinblampey commented 4 months ago

This should be fixed in sopa==1.0.12, can you try?

cstrlln commented 4 months ago

great! will give it a try in the next couple days.

cstrlln commented 4 months ago

Thanks, it is working now!

quentinblampey commented 4 months ago

Nice!

Do you mind telling me which export module you used? (description or screenshot) I could add this to the docs to help future users!

cstrlln commented 4 months ago

Of course. It was not necessarily a module. The way to get the data in flat files is within a study, click on Export and then selecet the "Flat CSV Files" section

Screen Shot 2024-05-22 at 3 11 07 PM f)

Screen Shot 2024-05-22 at 3 11 29 PM

cstrlln commented 4 months ago

Just in case, here are the files as of May 2024

Ah, yes, it would be great, thanks! I would need the _fov_positions_file, and the _tx_file. Here is my email address in case you need it: quentin.blampey@centralesupelec.fr

F099_mod_tx_file.csv.gz F099_fov_positions_file.csv.gz

quentinblampey commented 4 months ago

Thanks for these details, I'll update the docs accordingly!