Lorenzo tests - Githubissues

LorenzoMerotto commented 1 year ago

Overview of the analysis we want to carry on simulated datasets:

[x] Spillover analysis: basically we simulate n datasets, where n is the number of cell types that we have, and we deconvolute each one of these independently using the signature matrix build on all the cell types. We will do it on the LUNG data Parameters required:
- number of samples per cell types, cell types to use
- cell types to use. We have ~ 15 cell types but I'm not sure if we will use all of them
[x] Unknown cell content: the idea is that we take few cell types (e.g. B cells, T cells and such) + one cell type that will act as "unknown" content. The samples will be simulated containing an increasing value of unknown cell content. We will do it on the LUNG data Parameters required:
- cell types for the known content
- cell type for the unknown content (ideally tumor)
- fractions of the unknown cells that we want to have in the samples
[x] Impact of cell type resolution: in this case we could envision a multi-resolution deconvolution. We take the Lambrechts dataset and we consider three level of annotation.
- Higher level (B cells, T cells, etc)
- Intermediate level, the normal one (B cells, CD4, CD8, etc)
- Finer level (CD4, CD8 subtypes, etc)

We will then simulate some datasets using the finer level. Then we will obtain the samples + the facs. The facs can then be combined to obtain the samples composition at the three different levels. Then, starting with the same single cells we will build the signature matrix using the three different levels of annotation -> we wil get three signature matrices to be used to deconvolve the same bulk. We can then compare each bulk to the reference at the respective level of annotations. We could do this once for T cell subtypes, once for dendritic cell subtypes

[ ] Sensitivity to background noise: Basically samples with a defined fraction of a certain cell type + random content. I don't know if it's achievable with SimBu

Now, what I have in mind is: we create an individual Nextflow process for each of these setups, and we run all of them subsequentially in the simulation workflow. The problem is that in some cases we have more parameters than others, which would lead to many optional inputs for the SimulateBulkNF.R script. So theoretically we could create multiple R scripts for each simulation (es simulation_spillover.R, simulation_sensitivity.R, etc), which would be also a cleaner solution overall IMO

LorenzoMerotto commented 1 year ago

@alex-d13 Let me know what you think considering my comments

alex-d13 commented 1 year ago

I think splitting it up would be much nicer. So basically we have one process+script for each simulation setup right? And the workflow would then consist of:

simulation_spillover
simulation_unkown
simulation_resolution
simulation_background
build_signature
deconvolute
compute_metrics

Where the outputs of steps 1-4 will be concatenated into one long list of pseudo-bulks that we use as input for steps 5-7.

LorenzoMerotto commented 1 year ago

Then I think in this way it is easier to work independently on each simulation script, or to add new tests

LorenzoMerotto commented 1 year ago

@alex-d13 I updated the part for the different resolutions, how does that sound to you?

alex-d13 commented 1 year ago

I like it :)

LorenzoMerotto commented 1 year ago

Note for @LorenzoMerotto

Files that need to be added/Things to do:

[x] Update the results folders for each analysis specifically
[x] Signature building, deconvolution and metrics specific for resolution (we build 3 signatures and perform 3 deconvolution)
[x] Signature building specific for the unknown cell type analysis (we build a signature with a certain number of cell types). This can be applied also to the spillover one in case we don't have all the cell types

omnideconv / deconvBench

Lorenzo tests #23