stemangiola / CuratedAtlasQueryR

Tidy R query API for the harmonised and curated CELLxGENE single-cell atlas.
https://stemangiola.github.io/CuratedAtlasQueryR/
GNU General Public License v3.0
89 stars 7 forks source link

integrate cell types and reclassify #13

Open stemangiola opened 2 years ago

stemangiola commented 2 years ago

1) divide cells based on mcroclusters (e.g. B cells, CD8 T, monocytes). This is not always trivial, we have some high-confidence annotation, but some cells cannot be easily classified in T, B, Monocytes.

stemangiola commented 2 years ago

I start proposing a small number of transcriptomic markers, please if you can extend this list.

@ConnieLWS could you please add you gene list here?

This is the current gene list but it's still being refined:

Tcell.sig <- c("CD3G", "CD4","CAMK4", "CD2", "CD3D", "CD3E") 
Bcell.sig <- c("CD79A", "BANK1", "BLK", "CD19", "CD22",  "CD79B",  "CPNE5", "FCRL1") 
Monocyte.sig <- c("CD68", "CD14", "S100A9", "NKG7")
DC.sig <- c("FCER1A", "CLEC4C", "CIITA", "BCL11A")
NK.sig <- c("GNLY", "KLRF1", "NKG7", "KLRD1", "PRF1") 

FYI @goknurginer

ConnieLWS commented 2 years ago

Do you want tissue-specific marker genes for immune cells? If so, which tissue types would you like to focus on first?

stemangiola commented 2 years ago

Do you want tissue-specific marker genes for immune cells? If so, which tissue types would you like to focus on first?

No just a very small list of generic markers that would cluster integrated 11M cells of all tissues. after we divide cells into major macro clusters, we will integrate them separately using all genes.

stemangiola commented 2 years ago

With our small gene signature, we should "validate" it on the high-confidence cell types, for example using boxplots for the scaled gene-transcript abundance.

For obtaining the high-confidence cells, you can do

metadata |> filter(confidence_class==1)

stemangiola commented 2 years ago

In the meanwhile @multimeric add couple of features we need, let's start with MNN (scater) integration method using 10-50 genes, and start with 100K cells (we have 11M immune cells in total).

stemangiola commented 2 years ago

@ConnieLWS @multimeric FYI

"A unified analysis of atlas single cell data"

https://www.biorxiv.org/content/10.1101/2022.08.06.503038v1.full

multimeric commented 2 years ago

Here are some I think I'll try to benchmark, based on Connie's literature review:

stemangiola commented 2 years ago

Here are some I think I'll try to benchmark, based on Connie's literature review:

Great,

multimeric commented 2 years ago

You don't think we have scope for 2 Python tools?

stemangiola commented 2 years ago

You don't think we have scope for 2 Python tools?

Potentially, but the goal at this stage is to get the "minimum viable product", so we have to be careful of using our time parsimoniously. If you find yourself waiting for computation (we should avoid this testing on small chunks of data) you can work on your figure for the paper (in the todo list)

multimeric commented 2 years ago

Currently I have no data set to test these tools on anyway.

stemangiola commented 2 years ago

Currently I have no data set to test these tools on anyway.

You can first implement the tool with dummy data (the dataset queries in the README file). This initial dataset selection should not be a bottleneck.

ConnieLWS commented 1 year ago

Tested initial classification using 27 marker genes. The gene signature is still being refined.

Tcell.sig <- c("CD3G", "CD4","CAMK4", "CD2", "CD3D", "CD3E") 
Bcell.sig <- c("CD79A", "BANK1", "BLK", "CD19", "CD22",  "CD79B",  "CPNE5", "FCRL1") 
Monocyte.sig <- c("CD68", "CD14", "S100A9", "NKG7")
DC.sig <- c("FCER1A", "CLEC4C", "CIITA", "BCL11A")
NK.sig <- c("GNLY", "KLRF1", "NKG7", "KLRD1", "PRF1") 

Initial testing was performed on 2 samples (~10k cells each) from one dataset:

Image

Image