Some cells are missing from output

smorabit commented 2 years ago

I am trying to use CellTrek to map my single-cell RNA-seq data onto Visium ST coordinates. I have a dataset of 500k+ cells for my single-cell RNA-seq.

I ran the traint function with default arguments, and the co-embedding looks pretty reasonable. However, when I tried to run celltrek with the default arguments, I noticed that only a fraction of the cells in my single-cell dataset were mapped. I also tried downsampling my scRNA data to only map a portion with celltrek, but still there are a lot of cells missing from the output of celltrek. I am not sure why this is happening, but I am wondering what settings I should use in the celltrek function if I want to return coordinates for every input cell? Alternatively, if there are cells that have a low mapping to the ST dataset, and that's why they aren't showing up in the output, are there any metrics that celltrek returns to check the prediction condidence? Let me know if you need any more info to answer my question.

Thanks, Sam

simoncmo commented 2 years ago

Hi CellTrek team,

I'd like to piggyback on smorabit's question since I have some similar observation and really would like to understand this for downstream analysis I'm developing.

To be more specific, I actually found that

CellTrek keep only 50% of cell in my case, BUT duplicate (sometimes up to 4 times) of that 50% of cells

and end up with more cell count than the original scRNA object.

To give a little more details, Here's a quick summary of cell count I have after the run

As can see from A, B, total cell in CellTrek is more.
From C, D: only 49% of cells in scRNA were kept in CellTrek result
From E: CellTrek 'created' ~ 8k of cells

For the last part E, I investigated more, I realized that

CellTrek duplicate original cells and add `.1, .2, .3, .4` to the ids

For instance this ID got duplicated 4 times (5 including original ID)

Note: celltrek_id_orig is the ID without .1, .2, .3, .4.
in_scRNA/Group check if celltrek_id is in original scRNA object. (Group:
- orig = from scRNA, extend = duplications)

Lastly, I did a quick count on how many spots were duplication and how often,

To visualize duplications in cell ids

Note total here is 5090, the # of scRNA cell id kept in CellTrek
But if multiply by the time they duplicated, it will be 12946, the total # of cells in CellTrek
As can see, other than 1728 cells that is 'Unique' and not 'Extended', the rest of cells were duplicated various times
For instance, the AAACCGAAGCTATGAC.1 cell from the table above, falls into the last 5 times category in this plot

Overall

From what I can read from the paper and the Github, I cannot find a good explanation for this effect, I was wondering if

this is intended? or artifact from the script?
If it's intended, I wonder how to explain it biologically?

Thank you and would really appreciate the comment from the CellTrek team on this matter.

Thank you!! Simon

WandeRum commented 2 years ago

Hi Simon and Sam. Thank you for sharing your questions. Also thanks for such a detailed investigation. We have provided some parameters in the CellTrek function: top_spot=5 means 1 cell could be mapped to (at most) 5 spots and spot_n=5 means 1 spot could contain (at most) 5 cells, dist_thresh=0.55 meaning we set up a distance threshold for mapping. The reason we do so is that 1) in many cases, we observed biological and molecular-similar regions that existed across different areas of the tissue (less global spatial structure, more local structure, for example, two spatially distinct ductal structures in the human breast tissue which shows almost the same expression and histological patterns). We thus allowed some redundancies in the cell mapping. 2) we make the default parameters more strict (which yield fewer cells to be mapped), to avoid over-estimate of cells spatially. This led to only part of the cells could be mapped. But one can decrease the top_spot to avoid the cell redundancy, increase the intp_pnt to have more augmented spots, and increase the dist_thresh to allow more loose mapping. Do mind that with more cell mapping, the false positive could also increase. Hope this helps. We are working on making these parameters more intuitive in our next tutorial.

simoncmo commented 2 years ago

Hi WandeRum,

Thank you for the prompt reply and explanation! Yeah, that makes sense to me, and it sounds like it will take some effort to find the correct parameter to use for a given sample.

I also wonder how multiple mapping of scRNA cells could affect downstream analysis. My first thought is, that one might need to be cautious when interpreting downstream analysis results, especially those sensitive to the number of cells.

I also wonder how much one can read into the potential interaction between cells mapped near to each other?

Also, is there a metric to tell how confident is the mapping of each cell in the CellTrek result?

This could be helpful for identifying optimal parameter settings for the downstream analysis.

Thanks!

Simon

cadyyuheng commented 2 years ago

Hi @WandeRum

To follow up on this question, is there a way to force assign all single cells to a spatial location? I tried to change dist_thresh=0.55 to dist_thresh=2, but still didn't get all my single cells with at least one position assigned. What would be a setting so that all cells could be assigned? dist_thresh=10? dist_thresh=20? dist_thresh=100?

navinlabcode / CellTrek