omnideconv / SimBu

Simulate pseudo-bulk RNAseq samples from scRNAseq expression data
http://omnideconv.org/SimBu/
GNU General Public License v3.0
12 stars 1 forks source link

Package improvements #15

Closed alex-d13 closed 2 years ago

alex-d13 commented 2 years ago

Hi everyone,

Francesca and me had a discussion on some points we should/need to improve in the package. I will try to explain our issues and solutions here, please comment on anything unclear or if you would do it differently (@mlist):

OK, I hope i did not forget anything, I am looking forward to discuss these changes with you :) Alex

FFinotello commented 2 years ago

Hi @alex-d13

Thanks for the nice summary! Might be a good idea to update the manual and vignette accordingly, especially specifying the mandatory and optional arguments, as well as their default value (e.g. for total_read_counts I would consider something like "none" or NULL).

Cheers, Francesca

mlist commented 2 years ago

Hi @FFinotello, Alex and I just discussed these points. I agree with almost everything. For the first point I suggested that we use the Bioconductor ExpressionSet class which can handle multiple matrices by default (e.g., counts, TPM and others). Also we should stick to Bioconductor classes since I'd like us to submit this as a Bioconductor package.

I disagree with renaming census to expressed genes. Here I think we should stick to the original since the benchmark of the census and other methods is also part of the paper.

FFinotello commented 2 years ago

Hello!

Another current limitation of our approach could be this: we add a cell type-specific mRNA bias to counts and TPM... but we also have an approach to estimate mRNA scaling factors from counts, which implies that (at least) counts are already affected by mRNA bias.

If so, I would add to SimBu a parameter (e.g. remove_count_bias, set to TRUE by default) to first remove (not add!) the from single-cell intrinsic bias measured with the "Census" approach from counts, not TPM, and then add to both counts and TPM the user-specified bias. Does this make sense?

As this is a delicate but important step, I would probably assess its impact thoroughly with the usual scatterplots of cell fractions where we can see the impact of of mRNA bias and its correction. We could subject to this analysis both TPM and counts, considering the following scenarios:

Does this make sense to you guys? @mlist @alex-d13 @grst

mlist commented 2 years ago

I'm not sure I get it. Could you elaborate on this in our next meeting?

alex-d13 commented 2 years ago

If so, I would add to SimBu a parameter (e.g. remove_count_bias, set to TRUE by default) to first remove (not add!) the from single-cell intrinsic bias measured with the "Census" approach from counts, not TPM, and then add to both counts and TPM the user-specified bias. Does this make sense?

This would mean we will not offer Census as an option to add a bias, right? Otherwise we would first remove a bias with Census and then add the same again later.

Also, I would maybe calculate Census on the TPMs (if the user uploaded them), since Census is designed to work with TPM data.

We could subject to this analysis both TPM and counts, considering the following scenarios:

I am also a little bit confused with this section, as Markus said maybe we can discuss this in the next meeting :)

FFinotello commented 2 years ago

This would mean we will not offer Census as an option to add a bias, right? Otherwise we would first remove a bias with Census and then add the same again later.

It is a bit more complicated than this because:

Also, I would maybe calculate Census on the TPMs (if the user uploaded them), since Census is designed to work with TPM data.

This is a good idea. Looking at the data shared by Lorenzo, I am thinking that we could have in addition to Census also the following normalization scores (the user can then pick one):

And, sure we can discuss the scenarios! Also to check together my reasoning make sense :D

FFinotello commented 2 years ago

Hello! @alex-d13 and I have decided to go for a first assessment of the intrinsic mRNA bias... if any ;)

Meanwhile, I will report here a few suggestions I noted down when reading the first vignette:

alex-d13 commented 2 years ago

I find a bit confusing the nomenclature of the uniform and random distribution. Shall we rename uniform to even and random to `uniform (as this is the one actually sampled from a uniform distribution)?

I agree with the first change (uniform to even), but I would leave the random scenario name. After all i think the specific feature of it is that the cell type fractions are random. Maybe if we had included different distributions from which to sample for the random scenario, renaming it to the distributions name would be useful. For me the name uniform implies more that the cell types are spread out uniformly and not that it is sampled from the uniform distribution.

Is total_read_counts a number or can it be a vector of length == num. samples? Is unique_cell_type a single string or can be a vector of strings of length == num. samples?

Both of them are currently only a single value. If you want to generate simulations with different sequencing depth, you will have to run the function multiple times and then merge the results (I did add a method to merge simulations :))

What is simulation_vector? Could we give it a more meaningful name?

Thats where you specify the cell type fractions you want in the simulation. What could be a different name for this?

FFinotello commented 2 years ago

Thats where you specify the cell type fractions you want in the simulation. What could be a different name for this?

cell_type_fractions?! :)