Closed pakiessling closed 2 months ago
Hi @pakiessling,
Very interesting issue. When aggregating the transcripts, I use the categorical codes from pandas to get the column index of each gene. This should always be positive, but what I didn't know is that the index -1
can be used when a gene name is None
or Nan
; For instance:
>>> import pandas as pd
>>> pd.Series(["gene_a", "gene_b", "gene_c", None, np.nan]).astype("category").cat.codes.values
array([ 0, 1, 2, -1, -1], dtype=int8)
Can you check that you indeed have some None
/NaN
gene names under sdata["transcripts"]["feature_name"]
?
And does it work when you delete the rows corresponding to these transcripts?
This should be a pretty easy fix on my side
You are right, several "feature_name" appear to be NaN, mostly in the control probes:
deprecated_codeword 30772
unassigned_codeword 10516
predesigned_gene 6954
negative_control_codeword 5788
genomic_control_probe 403
negative_control_probe 169
10X recently updated the Xenium software so I wonder if they changed something with the formating.
I was actually not able to drop the entries. I tried
df = sdata["transcripts"].compute().dropna(subset=["feature_name"]).copy()
sdata.points["transcripts"] = sd.models.PointsModel.parse(df,feature_key="feature_name")
> ValueError: cannot reindex on an axis with duplicate labels
How do I do this 😅
Ok, good to know, I'll update the aggregation function
Meanwhile, you can try that:
sdata["transcripts"] = sdata["transcripts"].dropna(subset=["feature_name"])
Thank you! This indeed fixed the issue.
Tangentially related, Sopa right now seems to put all the Merscope / Xenium control probes into the matrix of the table while spatialdata_io puts these control probes into an .obsm table if I recall correctly.
It is not a big problem but a bit annoying in a case like my Xenium run where my final table now contains 5000 proper genes and 3188 control codewords that I will need to move.
I pushed a fix in the dev
branch if you want to test it out, and it will be released as a new version in the next few weeks!
Regarding the control probes, for now they should be removed manually, indeed. I might add an option to remove them during the aggregation, but it adds a new layer of complexity on top (because it adds a new argument in the API, in the CLI, and new parameters in the Snakemake config). If you think it is really important, I will consider adding this option, but else I prefer to avoid adding extra complexity to Sopa
Actually, I could do something smarter: saving a regex pattern in the attrs
of the transcripts SpatialElement
in the reader
This way, I don't need to add an extra parameter, I can just search inside the attrs
I add this to the future features list
I am running into a weird issue with our Xenium data while Merfish data runs through without problem:
I dont see anything wrong with the point table or the shapes on first look. Any idea what could cause this?