Closed grst closed 5 years ago
Maybe scmap? But you need to build a reference first. https://www.nature.com/articles/nmeth.4644
sounds interesting. That would basically make the batch effect removal unnecessary? That could work well with the 10x reference samples.
I would give it a try but I am not sure whether it is better to consider single data sets or the full compendium. It depends on the preprocessing steps performed internally by the tool. I should have a look at the code.
New resource of (bulk) reference profiles: https://dice-database.org/downloads from https://www.cell.com/cell/fulltext/S0092-8674(18)31331-X
Poster from ECCB about cell type classification 2085_001.pdf
List of marker genes validated with co-expression in TCGA https://jitc.biomedcentral.com/track/pdf/10.1186/s40425-017-0215-8
Moving forward, I see the following possibilities:
Use [scmap] to map all cells to 10x single cell reference porfiles of FACS purified cells. Downsides:
Use dca or k-NN imputation. Then simply assign all cells to a type that express the marker genes (above a threshold). Rationale: If sc data was not sparse the task would be easy. Marker genes are expressed on a certain cell type and a certain cell type only. But as sc data is sparse, we could only detect a small subset of cells. -> use imputation.
Implement something along the lines of Schelker et al. (2017). Basically same assumption as imputation. Use cells that do express markers as training set. Train a classifier (e.g. random forest) and use it to assign the other cells.
On the other hand, when only interested in CD8+ T cells, it could be pragmatic to simply apply unsupervised clustering and extract all clusters that express CD8A/B for further analysis.
In that case all other cell types would not be annotated.
@grst, you may make some FPs due to dropout (false null expression of CD8A/B).
You could check this potential issue in the sorted CD8+ cells from https://www.nature.com/articles/ncomms14049
Will build something along the lines of Schelker et al. now.
The approach is hierarchical:
Resources:
Francesca pointed me to moana (https://www.biorxiv.org/content/biorxiv/early/2018/10/30/456129.full.pdf).
However, models are dataset-specific and training data has obtained from manual clustering (for which they also provide a framework).
==> I will stick to my marker-gene based approach.
(Could be interesting to make use of the kNN-smoothing though)
For development of the classifier, see https://github.com/grst/single_cell_classification
We need to assign a cell type to each single cell, in order to select interesting populations and to be able to test the data integration tools.
possible methodologies:
@Hoohm, @FFinotello, do you have further ideas how to do this properly?