grst / single_cell_data_integration

1 stars 0 forks source link

Pipeline stages - overview #3

Closed grst closed 5 years ago

grst commented 5 years ago

0. gene expression quantification (#8)

1. consistent format (01_process_counts)

2. data cleaning (02_data_cleaning) (#5)

filter each dataset indiviually (min/max genes, percent_mito, ...)

3. data merging and confounder removal (#5)

4. batch effect removal (#7)

5. clustering, cell type identification, ...

mlist commented 5 years ago

Have you seen this one? https://www.ncbi.nlm.nih.gov/pubmed/29608177

grst commented 5 years ago

yes, scanorama (https://www.biorxiv.org/content/early/2018/07/17/371179) is actually a generalization of that approach from what I understood. The limitation of MNN is that it

mlist commented 5 years ago

ok, great.

On Thu, 25 Oct 2018 at 10:10 Gregor Sturm notifications@github.com wrote:

yes, scanorama (https://www.biorxiv.org/content/early/2018/07/17/371179) is actually a generalization of that approach from what I understood. The limitation of MNN is that it

  • depends on the order of the integration of the datasets
  • does not work well if not at least one cell population exists across all datasets.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/grst/single_cell_data_integration/issues/3#issuecomment-432955545, or mute the thread https://github.com/notifications/unsubscribe-auth/ABVg3bYo1zHC1OSlDnNFcvqE1xK9Gl0Eks5uoXHzgaJpZM4X1qv2 .

Hoohm commented 5 years ago

This just came out. Seems interesting: https://www.biorxiv.org/content/early/2018/10/31/457879

grst commented 5 years ago

Claims to be even better and faster than scanorama: Harmony

https://www.biorxiv.org/content/biorxiv/early/2018/11/05/461954.full.pdf?%3Fcollection=

grst commented 5 years ago

It probably makes sense to merge the datasets at an earlier stage:

  1. apply filtering (min/max genes, percent_mito, ...)
  2. merge, using outer join, into single adata object
    • at this point, zeros are still preserved in the data
    • at this point, we have not lost any genes that contain only zeros
    • if we merge after regressing out confounders, we won't have zeros any more and cannot do an outer join, loosing genes that might be specific to a certain cell type only.
  3. filter for highly variable genes
    • this is a standard processing step that is required by most batch effect removal tools and will speed up downstream analysis processes.
    • need to do this filtering step on the merged data. Because some datasets contain only T cells -> T cell markers would be removed as not variable, although they are highly relevant for the analysis.
  4. regress out confounders
    • hopefully this will work out on the entire dataset as well. Don't see a reason why it should not, though.
  5. last, apply batch effect removal tools

Will update the overview at the top of this issue shortly.

Hoohm commented 5 years ago

I approve of the early merging. This might help us for filtering downstream