Evaluation criteria - Githubissues

grst commented 3 years ago

Here's some ideas what I think we should evaluate, please add more if you have some!

Performance

Correlation with simulation
Correlation with gold standard (facs)
Spillover benchmark
Effect of data type (smartseq2 vs. 10x vs. ...)
Effect of missing cell-type in reference
Effect of discarding "weird" cell-types
Coarse vs. fine deconvolution (is it more robust to build a signature matrix with only a T cell signature, or with T cell CD4 and T cell CD8 and sum it up later)?
Balanced vs. unbalanced data
How many cells are required? --> subsampling

Other metrics

Runtime for training and deconvolution / scalability
- per size of single cell dataset
- per bulk sample
Memory usage
- per size of single cell dataset
- per bulk sample
"Usability metrics" see scib.

Other benchmarks

for inspiration: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02290-6

Summary figures

I really like how this is visualized in scib or the dynverse paper

FFinotello commented 3 years ago

I indeed love the dynverse resources: wonderful model :)

As for the scenarios, I would add:

Counts, TPM/normcounts
Tissue and disease context (e.g. blood signature vs. tumor mixture)
Various organisms (e.g., human, mouse)
Total counts (~sequencing depth)
Total number of different cell types
mRNA bias

Could you please elaborate the following aspects? Thanks!

Coarse vs. fine deconvolution
Balanced vs. unbalanced data

grst commented 3 years ago

Coarse vs. fine deconvolution

For instance, how well do we do when separating T-cells from B cells, versus separating regulatory CD4+ T cells from non-regulatory CD4+ T-cells

Balanced vs. unbalanced data

Do we get similar results when we have the same number of cells for each cell-type in the single-cell dataset compared to when we leave the original proportions?

FFinotello commented 3 years ago

Perfect! Then there is nothing else to add... for now ;)

grst commented 3 years ago

@kathireinisch, feeling daunted already? :stuck_out_tongue:

kathireinisch commented 3 years ago

@kathireinisch, feeling daunted already? 😛

yup haha! good thing i have more than a month to mentally prepare for all this :D

FFinotello commented 3 years ago

Do we get similar results when we have the same number of cells for each cell-type in the single-cell dataset compared to when we leave the original proportions?

Actually this (like the number of cells per cell type) can be both a feature o the (subsampled) single-cell dataset to build the reference matrix and simulated pseudobulk :)

kathireinisch commented 2 years ago

Since i don't know whether i can make it to the meeting next week due to an exam, i just ordered all thoughts/results/questions/... in a powerpoint presentation. i thought collecting everything here would be easier than creating multiple issues Intermediate_results.pptx It's a lot so i totally understand if you don't have the time to go through all of it in the next days, but i'd really appreciate feedback to at least a few of the issues 🙈 we could discuss the other things next week (if i can make it), but clarifying some things sooner than that would be really helpful! 😃

FFinotello commented 2 years ago

I am not sure I understand all content correctly. I think it would be better to discuss this live when you are available :)

mlist commented 2 years ago

Hey @kathireinisch it looks like you have covered a lot of ground here! I agree with @FFinotello though, it will be best to discuss this with you in person. Could you highlight which are the most pressing questions you have? It is also fine for me if you focus on your exam first and we get back to this afterwards.

kathireinisch commented 2 years ago

Okay that works, thank you! I guess the most important question is whether or not i'm currently using the correct single cell matrices (slide 4). In case i've been using the wrong ones i could rerun the pipeline again and we might already see some meaningful results. That would also solve the "effect of datatype" (slide 12) - should i use Maynard X_length_scaled and lambrechts raw counts for best comparability?

mlist commented 2 years ago

Thanks, we will discuss this with the group tomorrow and get back to you.

grst commented 2 years ago

In general

raw counts 10x equivalent of SmartSeq2 length scaled
10x CPM equivalent of SmartSeq2 TPM

I'm not sure if we can actually answer what's best. It likely depends on the method (some methods probably document what they expect) and could be another variable to be tested.

In this paper they tested different preprocessing strategies for 1st generation methods.

FFinotello commented 2 years ago

I agree with Gregor that, even without going into deep evaluation, we should at least test the effect of having counts vs. TPM in the input single-cell and bulk RNA-seq data (count-count, TPM-TPM, count-TPM, etc).

For 10x, you can't really compute TPM but CPM, as Gregor wrote

FFinotello commented 2 years ago

A quick comment on this criterium:

How many cells are required? --> subsampling

This might depend on the cell type annotation (e.g. would need more cells for heterogeneous clusters, fewer for homogeneous ones). Might be worth testing the extreme cases.

FFinotello commented 2 years ago

@kathireinisch, ine aspect we should not forget to evaluate is signature swapping between methods (wherever possible), to decouple the impact of signature building from deconvolution.

FFinotello commented 2 years ago

Note for me: some parameters settings could be also interesting to evaluate

kathireinisch commented 2 years ago

@kathireinisch, ine aspect we should not forget to evaluate is signature swapping between methods (wherever possible), to decouple the impact of signature building from deconvolution.

didn't we decide to let @constantin-zackl evaluate this? i remember we were talking about this when constantin started his practical phase and iirc we decided the signature swapping to also be a part of his bachelor thesis! but i'm also fine with including it in the pipeline if i could skip some other points instead :)

FFinotello commented 2 years ago

Constantin in focusing on the signatures themselves, not on the deconvolution results (which may change when swapping signatures). It is definitely a feature to be evaluated in the benchmarking, but maybe we could discuss together with @mlist about who will carry out this sub-task. This evaluation could be also performed at a later stage, but I prefer to have it noted down here so we do not forget ;)

FFinotello commented 2 years ago

Note for me: some parameters settings could be also interesting to evaluate

More structured comment on the parameter settings that it might be worth evaluating (tag: @kathireinisch @LorenzoMerotto )

Music: batch_ids, Center, Normalise
Scaden: Learning rate
AutogeneS: Model
CIBERSORTx: Batch correction
DWLS: Minimum fold change

omnideconv / deconvBench

Evaluation criteria #1

Performance

Other metrics

Other benchmarks

Summary figures