gustaveroussy / sopa

Technology-invariant pipeline for spatial omics analysis that scales to millions of cells (Xenium / Visium HD / MERSCOPE / CosMx / PhenoCycler / MACSima / etc)
https://gustaveroussy.github.io/sopa/
BSD 3-Clause "New" or "Revised" License
143 stars 16 forks source link

[Feature] Assigning transcript (points) to cells #132

Open pakiessling opened 1 month ago

pakiessling commented 1 month ago

Hi Quentin,

I saw in #87 that you were making the choice to not update the points dataframe with the newly assigned cell id. I can see the value in not modifying the original data when it comes to something like the image, however I do not really see what it adds in the case of the cell assignment. So far I never wanted to go back to the original assignment (or If I would like to I still have the sopa input) but I really need the point-to-polygon assignment for analysis I want to do. Maybe just adding an additional column with the new cell_id would be a compromise?

quentinblampey commented 1 month ago

Hi @pakiessling,

Have you tried this _map_transcript_to_cell function? It should do exactly what you want.

If that works and if it's useful, then I could make it more visible in the docs!

pakiessling commented 1 month ago

Hi, yes I ended up using this very fast & smooth.

It is very useful for things like subcellular analysis.

Would be cool to have this more visible!

quentinblampey commented 1 month ago

Nice! Okay, I'll make it more visible!

pakiessling commented 1 month ago

@quentinblampey I actually have a question about _map_transcript_to_cell I used it lie this:

_map_transcript_to_cell(
    sdata,
    "cellpose_boundaries_row_nr",
    sdata.points["my_points],
    "cellpose_boundaries",
)

I assumed that the values it added were the row numbers of the geopandas dataframe. But:

sdata.points["my_points].compute().cellpose_boundaries_row_nr.value_counts()
cellpose_boundaries_row_nr
0        1331235
70150       3643
87365       3521
47877       3453
71058       3285
          ...   
92583         10
29033         10
65535         10
57416         10
44450         10

This doesnt match the values in the table:

sc.pp.calculate_qc_metrics(sdata.tables["table"],inplace=True)
sdata.tables["table"].obs.iloc[0]
total_counts                                        141
sdata.tables["table"].obs.iloc[70150]
total_counts                                       1048

Do you know what could be happening there? This is fresh after Sopa Snakemake pipeline has been run. So everything should be correctly aligned?

quentinblampey commented 1 month ago

The IDs are shifted by 1 because the ID 0 is already used for transcripts that are not assigned to any cell. Can you look at the counts of the cell at iloc 70151 instead?

I would prefer to use NA for the transcripts that are not assigned, this way we would not need to shift the indices. But this is the expected Baysor input. I can add an argument to choose how we handle the "non-assigned transcripts", what do you think?

pakiessling commented 1 month ago

Ok I understand now.

In my data I actually have to substract one from the added number. 70149 corresponds to the correct row in the table and shapes.

I think NA would be less confusing. My favorite would actually be NA + the index (cell_id) from shapes and tables if that would be possible

quentinblampey commented 1 month ago

70149 corresponds to the correct row in the table and shapes.

Yes, sorry, it's 70149 indeed.

Okay, I'll add the possibility to have NA + cell index!

pakiessling commented 1 month ago

Thanks a lot!