Closed grst closed 7 months ago
I indeed love the dynverse resources: wonderful model :)
As for the scenarios, I would add:
Could you please elaborate the following aspects? Thanks!
Coarse vs. fine deconvolution
For instance, how well do we do when separating T-cells from B cells, versus separating regulatory CD4+ T cells from non-regulatory CD4+ T-cells
Balanced vs. unbalanced data
Do we get similar results when we have the same number of cells for each cell-type in the single-cell dataset compared to when we leave the original proportions?
Perfect! Then there is nothing else to add... for now ;)
@kathireinisch, feeling daunted already? :stuck_out_tongue:
@kathireinisch, feeling daunted already? 😛
yup haha! good thing i have more than a month to mentally prepare for all this :D
Do we get similar results when we have the same number of cells for each cell-type in the single-cell dataset compared to when we leave the original proportions?
Actually this (like the number of cells per cell type) can be both a feature o the (subsampled) single-cell dataset to build the reference matrix and simulated pseudobulk :)
Since i don't know whether i can make it to the meeting next week due to an exam, i just ordered all thoughts/results/questions/... in a powerpoint presentation. i thought collecting everything here would be easier than creating multiple issues Intermediate_results.pptx It's a lot so i totally understand if you don't have the time to go through all of it in the next days, but i'd really appreciate feedback to at least a few of the issues 🙈 we could discuss the other things next week (if i can make it), but clarifying some things sooner than that would be really helpful! 😃
I am not sure I understand all content correctly. I think it would be better to discuss this live when you are available :)
Hey @kathireinisch it looks like you have covered a lot of ground here! I agree with @FFinotello though, it will be best to discuss this with you in person. Could you highlight which are the most pressing questions you have? It is also fine for me if you focus on your exam first and we get back to this afterwards.
Okay that works, thank you! I guess the most important question is whether or not i'm currently using the correct single cell matrices (slide 4). In case i've been using the wrong ones i could rerun the pipeline again and we might already see some meaningful results. That would also solve the "effect of datatype" (slide 12) - should i use Maynard X_length_scaled and lambrechts raw counts for best comparability?
Thanks, we will discuss this with the group tomorrow and get back to you.
In general
I'm not sure if we can actually answer what's best. It likely depends on the method (some methods probably document what they expect) and could be another variable to be tested.
In this paper they tested different preprocessing strategies for 1st generation methods.
I agree with Gregor that, even without going into deep evaluation, we should at least test the effect of having counts vs. TPM in the input single-cell and bulk RNA-seq data (count-count, TPM-TPM, count-TPM, etc).
For 10x, you can't really compute TPM but CPM, as Gregor wrote
A quick comment on this criterium:
How many cells are required? --> subsampling
This might depend on the cell type annotation (e.g. would need more cells for heterogeneous clusters, fewer for homogeneous ones). Might be worth testing the extreme cases.
@kathireinisch, ine aspect we should not forget to evaluate is signature swapping between methods (wherever possible), to decouple the impact of signature building from deconvolution.
Note for me: some parameters settings could be also interesting to evaluate
@kathireinisch, ine aspect we should not forget to evaluate is signature swapping between methods (wherever possible), to decouple the impact of signature building from deconvolution.
didn't we decide to let @constantin-zackl evaluate this? i remember we were talking about this when constantin started his practical phase and iirc we decided the signature swapping to also be a part of his bachelor thesis! but i'm also fine with including it in the pipeline if i could skip some other points instead :)
Constantin in focusing on the signatures themselves, not on the deconvolution results (which may change when swapping signatures). It is definitely a feature to be evaluated in the benchmarking, but maybe we could discuss together with @mlist about who will carry out this sub-task. This evaluation could be also performed at a later stage, but I prefer to have it noted down here so we do not forget ;)
Note for me: some parameters settings could be also interesting to evaluate
More structured comment on the parameter settings that it might be worth evaluating (tag: @kathireinisch @LorenzoMerotto )
batch_ids
, Center
, Normalise
Learning rate
Model
Batch correction
Minimum fold change
Here's some ideas what I think we should evaluate, please add more if you have some!
Performance
Other metrics
Other benchmarks
for inspiration: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02290-6
Summary figures
I really like how this is visualized in scib or the dynverse paper