Mismatch in celltype and cell_id for brain data

Tijs-dot commented 10 months ago

Hi,

Thanks for generating this repository. I am trying to use your single-cell brain data. I am using files obtained from https://www.proteinatlas.org/about/download . Specifically I use https://www.proteinatlas.org/about/download#:~:text=RNA%20single%20cell%20type%20data https://www.proteinatlas.org/about/download#:~:text=RNA%20single%20cell%20type%20tissue%20cluster%20data https://www.proteinatlas.org/about/download#:~:text=RNA%20single%20cell%20read%20count%20data

I create a Seurat object from the count matrix, which I then process using SCTransform. I then want to add the cell type from the rna_single_cell_cluster_description.tsv file to the cells by matching them based on the clusters. So I match the clusters from these two files: cell_id cluster umap_x umap_y 1 15 -1.78285 10.8605 2 29 4.7399 2.35851 3 10 14.4604 -0.62549 4 24 6.90615 6.53858 ...

Tissue Cluster Cell type Cell type group Cell count Brain c-0 Excitatory neurons Neuronal cells 4644 Brain c-1 Excitatory neurons Neuronal cells 4370 Brain c-2 Excitatory neurons Neuronal cells 3732 Brain c-3 Excitatory neurons Neuronal cells 3724 ...

When I plot the dataset using the UMAP coordinates that are included, I get the same plot as shown on the human protein atlas website, and when annotating them using the Cell Types matched from the second file, this also is the same as what is shown on the website. However, when I then look at marker gene expression for for example GAD1 and NEUROD6, the expression for these genes is scattered throughout the umap, and does not cluster together within clusters that are supposed to be inhibitory or excitatory neurons. Could this be because I am making a mistake in the files I am using when matching cells to their annotation?

Below I added a screenshot when plotting the annotations from the file (first) and GAD1 expression (second)

It would be great if you can help me with this! Best, Tijs

maxkarlsson commented 10 months ago

Hi Tijs!

I would recommend sending this question to contact@proteinatlas.org and they will make sure your question arrives to the people responsible for the cluster annotation. This repo describes the analysis made for the original paper, but there have been multiple HPA versions since then.

However, it is common with high dropout rates in scRNAseq due to transcriptional bursting and technical bias so this could explain the "scattered" expression in the heatmap. Sometimes the function VlnPlot() does a better job at showing where you see most expression.

I hope that helps.

Best regards, Max

Tijs-dot commented 10 months ago

Hi Max, I made a mistake in normalization, it seems to be as it should now :) thanks!

maxkarlsson commented 10 months ago

Hi Tijs!

Happy to hear that you found a solution!

Best, Max

maxkarlsson / HPA-SingleCellType

Mismatch in celltype and cell_id for brain data #2