cleaning of individual datasets.

grst commented 6 years ago

Datasets need to be cleaned and normalized before scanorama integration (#2). I identified the following steps following these tutorials:

before merging

[x] filtering and diagnostic plots
- max genes
- min genes
- max mitochondrial fraction
[x] doublet detection
- [x] https://github.com/AllonKleinLab/scrublet (Issue) resolved.
- [x] https://github.com/JonathanShor/DoubletDetection
[?] denoising (e.g. DCA)
[x] clustering/quality of clustering (silouhette score)

after merging

[x] normalization (per-cell)
- retain raw counts in anndata object.
[x] filtering for highly variable genes. (should that be done at all?, before integration, ...)
[x] regress out confounders
- percent_mito
- n_genes
- n_counts
- cell_cycle -> Color tSNE/umap plot after each step by -> ideally, we observe clustering by cell type, not by other factors.
[x] preliminary cell type identification (i.e. visualize marker genes on UMAP plot)
- in a first step, use marker genes from MCPcounter

grst commented 6 years ago

@Hoohm, any other steps you would consider?

Hoohm commented 6 years ago

I would not recommend imputation as it is always predicated upon the quality of the clustering and rarely help much.

For first steps that seems fine to me. What do you want to do after that?

grst commented 6 years ago

next step would be to feed everything into scanorama to remove batch effects.

Hoohm commented 6 years ago

Forgot about defining a method (or methods) to compare clustering "quality". From the top of my head I know about Silhouette plots

mlist commented 6 years ago

Silhouette value is a good start. If you know the cell type of individual cells you can use the ontology score we proposed

https://dx.doi.org/10.1093%2Fbioinformatics%2Fbty553

Might not work so well with cancer cells though.Also see other methods referenced in the paper, in particular kbet from the Theiss group

https://doi.org/10.1101/200345

Best, Markus

Am Mi., 31. Okt. 2018, 22:44 hat Patrick Roelli notifications@github.com geschrieben:

Forgot about defining a method (or methods) to compare clustering "quality". From the top of my head I know about Silhouette plots

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/grst/single_cell_data_integration/issues/5#issuecomment-434858480, or mute the thread https://github.com/notifications/unsubscribe-auth/ABVg3eJuyTUVbfMAXldyEk18H2cs36JEks5uqhmXgaJpZM4X9bX0 .

grst commented 6 years ago

Split this up in before merge/after merge (see https://github.com/grst/single_cell_data_integration/issues/3#issuecomment-439397642)

grst / single_cell_data_integration

cleaning of individual datasets. #5

before merging

after merging