cell type annotations - Githubissues

grst commented 5 years ago

We need to assign a cell type to each single cell, in order to select interesting populations and to be able to test the data integration tools.

possible methodologies:

SingleR uses reference profiles (bulk RNAseq/array) of pure immune cells to annotate cells. Does not scale to large datasets, but could be ran in small chunks.
Schelker: semi-supervised.
- run unsupervised clustering
- use marker genes to calculate a score for each cell
- use high scoring cells intersected with clusters for training a decision tree
- use decision tree to annotate cell types.
Scanpy/Seurat tutorial:
- Unsupervised clustering (Louvain)
- all-against-all DE
- assign cell type to the clusters based on the marker genes identified through DE (hope that the marker genes are recovered)
10x provides single cell data of FACS-purified cells (bottom of page). Could potentially be interesting for annotation.

@Hoohm, @FFinotello, do you have further ideas how to do this properly?

FFinotello commented 5 years ago

Maybe scmap? But you need to build a reference first. https://www.nature.com/articles/nmeth.4644

grst commented 5 years ago

sounds interesting. That would basically make the batch effect removal unnecessary? That could work well with the 10x reference samples.

FFinotello commented 5 years ago

I would give it a try but I am not sure whether it is better to consider single data sets or the full compendium. It depends on the preprocessing steps performed internally by the tool. I should have a look at the code.

grst commented 5 years ago

New resource of (bulk) reference profiles: https://dice-database.org/downloads from https://www.cell.com/cell/fulltext/S0092-8674(18)31331-X

grst commented 5 years ago

Poster from ECCB about cell type classification 2085_001.pdf

grst commented 5 years ago

List of marker genes validated with co-expression in TCGA https://jitc.biomedcentral.com/track/pdf/10.1186/s40425-017-0215-8

grst commented 5 years ago

Moving forward, I see the following possibilities:

Use [scmap] to map all cells to 10x single cell reference porfiles of FACS purified cells. Downsides:
- limited to the reference profiles available (should do in our case)
- apparently hard to distinguish e.g. NK cells from CD8+ cells (at least using the naive correlation based approach used in the 10x Zheng Bileas Paper, might be better with scmap)
- reference profiles are from PBMC, maybe cells from the microenvironment don't match properly.
Use dca or k-NN imputation. Then simply assign all cells to a type that express the marker genes (above a threshold). Rationale: If sc data was not sparse the task would be easy. Marker genes are expressed on a certain cell type and a certain cell type only. But as sc data is sparse, we could only detect a small subset of cells. -> use imputation.
Implement something along the lines of Schelker et al. (2017). Basically same assumption as imputation. Use cells that do express markers as training set. Train a classifier (e.g. random forest) and use it to assign the other cells.
- Note: exclude the marker gene as feature, as the cells we want to annotate don't express it.
- Hierarchical approach might be interesting: i.e. identify all T cells first, then classify subtypes. This mitigates problems with overlapping marker genes (e.g. NK cells and CD8+ cells)

grst commented 5 years ago

On the other hand, when only interested in CD8+ T cells, it could be pragmatic to simply apply unsupervised clustering and extract all clusters that express CD8A/B for further analysis.

In that case all other cell types would not be annotated.

FFinotello commented 5 years ago

@grst, you may make some FPs due to dropout (false null expression of CD8A/B).

You could check this potential issue in the sorted CD8+ cells from https://www.nature.com/articles/ncomms14049

grst commented 5 years ago

Will build something along the lines of Schelker et al. now.

The approach is hierarchical:

The first level of cell type annotation should be based on lineage markers, that are specific (expressed on a group of cell types and only there).
The second level further divides these populations using cell type specific markers.

Resources:

comprehensive list in Janeway's immunobiology.
this figure from wikipedia

grst commented 5 years ago

Francesca pointed me to moana (https://www.biorxiv.org/content/biorxiv/early/2018/10/30/456129.full.pdf).

It uses a hierachical classifier similar to the one I planned to implement.
It makes use of kNN-smoothing to overcome technical noise.

However, models are dataset-specific and training data has obtained from manual clustering (for which they also provide a framework).

==> I will stick to my marker-gene based approach.

(Could be interesting to make use of the kNN-smoothing though)

grst commented 5 years ago

For development of the classifier, see https://github.com/grst/single_cell_classification

grst / single_cell_data_integration

cell type annotations #12