Pipeline stages - overview

grst commented 5 years ago

0. gene expression quantification (#8)

for all datasets ~~where no counts are provided~~, we ~~need to~~ do the preprocessing from FASTQ files ourselves.
Like this the data is consistently processed and we have consistent gene identifiers (#4).

1. consistent format (`01_process_counts`)

bring all datasets in a consistent format that can easily be loaded into scanpy
- generate a MTX file for each dataset
all datasets have consistent gene identifiers (ideally ENSEMBL of the same version)

2. data cleaning (`02_data_cleaning`) (#5)

filter each dataset indiviually (min/max genes, percent_mito, ...)

3. data merging and confounder removal (#5)

regress-out cell cycle
normalize
transformations (e.g. log)

4. batch effect removal (#7)

5. clustering, cell type identification, ...

cell type annotation (#12)
trajectory inference (PAGA, monocle2, MERLot,... (#15 )

mlist commented 5 years ago

Have you seen this one? https://www.ncbi.nlm.nih.gov/pubmed/29608177

grst commented 5 years ago

yes, scanorama (https://www.biorxiv.org/content/early/2018/07/17/371179) is actually a generalization of that approach from what I understood. The limitation of MNN is that it

depends on the order of the integration of the datasets
does not work well if not at least one cell population exists across all datasets.

mlist commented 5 years ago

ok, great.

On Thu, 25 Oct 2018 at 10:10 Gregor Sturm notifications@github.com wrote:

yes, scanorama (https://www.biorxiv.org/content/early/2018/07/17/371179) is actually a generalization of that approach from what I understood. The limitation of MNN is that it

depends on the order of the integration of the datasets

does not work well if not at least one cell population exists across all datasets.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/grst/single_cell_data_integration/issues/3#issuecomment-432955545, or mute the thread https://github.com/notifications/unsubscribe-auth/ABVg3bYo1zHC1OSlDnNFcvqE1xK9Gl0Eks5uoXHzgaJpZM4X1qv2 .

Hoohm commented 5 years ago

This just came out. Seems interesting: https://www.biorxiv.org/content/early/2018/10/31/457879

grst commented 5 years ago

Claims to be even better and faster than scanorama: Harmony

https://www.biorxiv.org/content/biorxiv/early/2018/11/05/461954.full.pdf?%3Fcollection=

grst commented 5 years ago

It probably makes sense to merge the datasets at an earlier stage:

apply filtering (min/max genes, percent_mito, ...)
merge, using outer join, into single adata object
- at this point, zeros are still preserved in the data
- at this point, we have not lost any genes that contain only zeros
- if we merge after regressing out confounders, we won't have zeros any more and cannot do an outer join, loosing genes that might be specific to a certain cell type only.
filter for highly variable genes
- this is a standard processing step that is required by most batch effect removal tools and will speed up downstream analysis processes.
- need to do this filtering step on the merged data. Because some datasets contain only T cells -> T cell markers would be removed as not variable, although they are highly relevant for the analysis.
regress out confounders
- hopefully this will work out on the entire dataset as well. Don't see a reason why it should not, though.
last, apply batch effect removal tools

Will update the overview at the top of this issue shortly.

Hoohm commented 5 years ago

I approve of the early merging. This might help us for filtering downstream

grst / single_cell_data_integration