[Bug] Aggregator.compute_table raises ValueError: negative axis 1 index: -1

pakiessling commented 2 months ago

I am running into a weird issue with our Xenium data while Merfish data runs through without problem:


aggregator = sopa.segmentation.Aggregator(sdata, image_key="morphology_focus", shapes_key="cellpose_boundaries")
aggregator.compute_table(gene_column="feature_name", average_intensities=False)

---------------------------------------------------------------------------
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 2
      1 aggregator = sopa.segmentation.Aggregator(sdata, image_key="morphology_focus", shapes_key="cellpose_boundaries")
----> 2 aggregator.compute_table(gene_column="feature_name", average_intensities=False)

File /work/rwth1209/enviroments/sopa2/lib/python3.10/site-packages/sopa/segmentation/aggregate.py:247, in Aggregator.compute_table(self, gene_column, average_intensities, expand_radius_ratio, min_transcripts, min_intensity_ratio, save_table)
    245         log.warn("sdata.table is already existing. Transcripts are not count again.")
    246     else:
--> 247         self.table = count_transcripts(self.sdata, gene_column, shapes_key=self.shapes_key)
    249 if does_count and min_transcripts > 0:
    250     self.filter_cells(self.table.X.sum(axis=1) < min_transcripts)

File /work/rwth1209/enviroments/sopa2/lib/python3.10/site-packages/sopa/segmentation/aggregate.py:421, in count_transcripts(sdata, gene_column, shapes_key, points_key, geo_df)
    418     geo_df = to_intrinsic(sdata, geo_df, points_key)
    420 log.info(f"Aggregating transcripts over {len(geo_df)} cells")
--> 421 return _count_transcripts_aligned(geo_df, points, gene_column)

File /work/rwth1209/enviroments/sopa2/lib/python3.10/site-packages/sopa/segmentation/aggregate.py:452, in _count_transcripts_aligned(geo_df, points, value_key)
    446 X_partitions = []
    448 with ProgressBar():
    449     points.map_partitions(
    450         partial(_add_coo, X_partitions, geo_df, gene_column=value_key, gene_names=gene_names),
    451         meta=(),
--> 452     ).compute()
    454 for X_partition in X_partitions:
    455     adata.X += X_partition

File /work/rwth1209/enviroments/sopa2/lib/python3.10/site-packages/dask/base.py:376, in DaskMethodsMixin.compute(self, **kwargs)
    352 def compute(self, **kwargs):
    353     """Compute this dask collection
    354 
    355     This turns a lazy Dask collection into its in-memory equivalent.
   (...)
    374     dask.compute
    375     """
--> 376     (result,) = compute(self, traverse=False, **kwargs)
    377     return result

File /work/rwth1209/enviroments/sopa2/lib/python3.10/site-packages/dask/base.py:662, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    659     postcomputes.append(x.__dask_postcompute__())
    661 with shorten_traceback():
--> 662     results = schedule(dsk, keys, **kwargs)
    664 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File /work/rwth1209/enviroments/sopa2/lib/python3.10/site-packages/sopa/segmentation/aggregate.py:474, in _add_coo(X_partitions, geo_df, partition, gene_column, gene_names)
    471 joined = geo_df.sjoin(points_gdf)
    472 cells_indices, column_indices = joined.index, joined[gene_column].cat.codes
--> 474 X_partition = coo_matrix(
    475     (np.full(len(cells_indices), 1), (cells_indices, column_indices)),
    476     shape=(len(geo_df), len(gene_names)),
    477 )
    479 X_partitions.append(X_partition)

File /work/rwth1209/enviroments/sopa2/lib/python3.10/site-packages/scipy/sparse/_coo.py:99, in _coo_base.__init__(self, arg1, shape, dtype, copy)
     96 if dtype is not None:
     97     self.data = self.data.astype(dtype, copy=False)
---> 99 self._check()

File /work/rwth1209/enviroments/sopa2/lib/python3.10/site-packages/scipy/sparse/_coo.py:208, in _coo_base._check(self)
    205     raise ValueError(f'axis {i} index {idx.max()} exceeds '
    206                      f'matrix dimension {self.shape[i]}')
    207 if idx.min() < 0:
--> 208     raise ValueError(f'negative axis {i} index: {idx.min()}')

ValueError: negative axis 1 index: -1

I dont see anything wrong with the point table or the shapes on first look. Any idea what could cause this?

quentinblampey commented 2 months ago

Hi @pakiessling, Very interesting issue. When aggregating the transcripts, I use the categorical codes from pandas to get the column index of each gene. This should always be positive, but what I didn't know is that the index -1 can be used when a gene name is None or Nan; For instance:

>>> import pandas as pd
>>> pd.Series(["gene_a", "gene_b", "gene_c", None, np.nan]).astype("category").cat.codes.values
array([ 0,  1,  2, -1, -1], dtype=int8)

Can you check that you indeed have some None/NaN gene names under sdata["transcripts"]["feature_name"]? And does it work when you delete the rows corresponding to these transcripts?

This should be a pretty easy fix on my side

pakiessling commented 2 months ago

You are right, several "feature_name" appear to be NaN, mostly in the control probes:

deprecated_codeword          30772
unassigned_codeword          10516
predesigned_gene              6954
negative_control_codeword     5788
genomic_control_probe          403
negative_control_probe         169

10X recently updated the Xenium software so I wonder if they changed something with the formating.

I was actually not able to drop the entries. I tried

df = sdata["transcripts"].compute().dropna(subset=["feature_name"]).copy()
sdata.points["transcripts"] = sd.models.PointsModel.parse(df,feature_key="feature_name")
> ValueError: cannot reindex on an axis with duplicate labels

How do I do this 😅

quentinblampey commented 2 months ago

Ok, good to know, I'll update the aggregation function

Meanwhile, you can try that:

sdata["transcripts"] = sdata["transcripts"].dropna(subset=["feature_name"])

pakiessling commented 2 months ago

Thank you! This indeed fixed the issue.

Tangentially related, Sopa right now seems to put all the Merscope / Xenium control probes into the matrix of the table while spatialdata_io puts these control probes into an .obsm table if I recall correctly.

It is not a big problem but a bit annoying in a case like my Xenium run where my final table now contains 5000 proper genes and 3188 control codewords that I will need to move.

quentinblampey commented 2 months ago

I pushed a fix in the dev branch if you want to test it out, and it will be released as a new version in the next few weeks!

Regarding the control probes, for now they should be removed manually, indeed. I might add an option to remove them during the aggregation, but it adds a new layer of complexity on top (because it adds a new argument in the API, in the CLI, and new parameters in the Snakemake config). If you think it is really important, I will consider adding this option, but else I prefer to avoid adding extra complexity to Sopa

quentinblampey commented 2 months ago

Actually, I could do something smarter: saving a regex pattern in the attrs of the transcripts SpatialElement in the reader This way, I don't need to add an extra parameter, I can just search inside the attrs I add this to the future features list

gustaveroussy / sopa

[Bug] Aggregator.compute_table raises ValueError: negative axis 1 index: -1 #101