Closed alex-d13 closed 2 years ago
Hi @alex-d13
Thanks for the nice summary! Might be a good idea to update the manual and vignette accordingly, especially specifying the mandatory and optional arguments, as well as their default value (e.g. for total_read_counts
I would consider something like "none" or NULL).
Cheers, Francesca
Hi @FFinotello, Alex and I just discussed these points. I agree with almost everything. For the first point I suggested that we use the Bioconductor ExpressionSet class which can handle multiple matrices by default (e.g., counts, TPM and others). Also we should stick to Bioconductor classes since I'd like us to submit this as a Bioconductor package.
I disagree with renaming census to expressed genes. Here I think we should stick to the original since the benchmark of the census and other methods is also part of the paper.
Hello!
Another current limitation of our approach could be this: we add a cell type-specific mRNA bias to counts and TPM... but we also have an approach to estimate mRNA scaling factors from counts, which implies that (at least) counts are already affected by mRNA bias.
If so, I would add to SimBu a parameter (e.g. remove_count_bias, set to TRUE by default) to first remove (not add!) the from single-cell intrinsic bias measured with the "Census" approach from counts, not TPM, and then add to both counts and TPM the user-specified bias. Does this make sense?
As this is a delicate but important step, I would probably assess its impact thoroughly with the usual scatterplots of cell fractions where we can see the impact of of mRNA bias and its correction. We could subject to this analysis both TPM and counts, considering the following scenarios:
Does this make sense to you guys? @mlist @alex-d13 @grst
I'm not sure I get it. Could you elaborate on this in our next meeting?
If so, I would add to SimBu a parameter (e.g. remove_count_bias, set to TRUE by default) to first remove (not add!) the from single-cell intrinsic bias measured with the "Census" approach from counts, not TPM, and then add to both counts and TPM the user-specified bias. Does this make sense?
This would mean we will not offer Census as an option to add a bias, right? Otherwise we would first remove a bias with Census and then add the same again later.
Also, I would maybe calculate Census on the TPMs (if the user uploaded them), since Census is designed to work with TPM data.
We could subject to this analysis both TPM and counts, considering the following scenarios:
I am also a little bit confused with this section, as Markus said maybe we can discuss this in the next meeting :)
This would mean we will not offer Census as an option to add a bias, right? Otherwise we would first remove a bias with Census and then add the same again later.
It is a bit more complicated than this because:
Also, I would maybe calculate Census on the TPMs (if the user uploaded them), since Census is designed to work with TPM data.
This is a good idea. Looking at the data shared by Lorenzo, I am thinking that we could have in addition to Census also the following normalization scores (the user can then pick one):
And, sure we can discuss the scenarios! Also to check together my reasoning make sense :D
Hello! @alex-d13 and I have decided to go for a first assessment of the intrinsic mRNA bias... if any ;)
Meanwhile, I will report here a few suggestions I noted down when reading the first vignette:
uniform
and random
distribution. Shall we rename uniform
to even
and random
to `uniform (as this is the one actually sampled from a uniform distribution)?spike_in
(cell fraction) simulated scenario could be maybe called controlled
? Also related arguments (e.g. spike_in_cell_type
and spike_in_amount
) should be updated accordingly.total_read_counts
a number or can it be a vector of length == num. samples?unique_cell_type
a single string or can be a vector of strings of length == num. samples?simulation_vector
? Could we give it a more meaningful name?I find a bit confusing the nomenclature of the uniform and random distribution. Shall we rename uniform to even and random to `uniform (as this is the one actually sampled from a uniform distribution)?
I agree with the first change (uniform to even), but I would leave the random scenario name. After all i think the specific feature of it is that the cell type fractions are random. Maybe if we had included different distributions from which to sample for the random scenario, renaming it to the distributions name would be useful. For me the name uniform implies more that the cell types are spread out uniformly and not that it is sampled from the uniform distribution.
Is total_read_counts a number or can it be a vector of length == num. samples? Is unique_cell_type a single string or can be a vector of strings of length == num. samples?
Both of them are currently only a single value. If you want to generate simulations with different sequencing depth, you will have to run the function multiple times and then merge the results (I did add a method to merge simulations :))
What is simulation_vector? Could we give it a more meaningful name?
Thats where you specify the cell type fractions you want in the simulation. What could be a different name for this?
Thats where you specify the cell type fractions you want in the simulation. What could be a different name for this?
cell_type_fractions
?! :)
Hi everyone,
Francesca and me had a discussion on some points we should/need to improve in the package. I will try to explain our issues and solutions here, please comment on anything unclear or if you would do it differently (@mlist):
dataset
currently thedataset
can only hold a single count matrix, this can either be TPM, CPM, raw or what ever else. We thought it would be useful to have two slots in thedataset
:count_matrix
andtpm_matrix
. For thetpm_matrix
we would expect TPM data, so we could add a small check function, to see if the counts of each cell could be TPM data. As a simple idea could be to check ifX <= tpm_counts <= 1e6
, whereX
is a new parameter set to7*1e5
by default.ncells
) and the sum of the counts of the already sampled cells (total_read_counts
). We found out that the current implementation of thetotal_read_counts
might remove the mRNA bias in the pseudo-bulk sample, so we came up with something new: We still have both arguments as above, but now you can either use onlyncells
orncells
withtotal_read_counts
combined: lets say we have a simulation vector with (B cells = 0.6, T cells = 0.4),ncells
=1000 andtotal_read_counts
=1e7. We will then sample 0.6 1000 B cells and 0.4 1000 T cells with replacement and create a matrixm
of all genes and 1000 cells. This is the standard approach when only usingncells
. Summing up the counts inm
per gene gives on pseudo-bulk sample (Nothing new until now). With the addition oftotal_read_counts
, we will now have two options:sum(pseudo_bulk_sample) > total_read_counts
: we rarefy the counts in the sample to the wanted sequencing depthsum(pseudo_bulk_sample) < total_read_counts
: we multiply the pseudo-bulk sample withk
to get to our wanted sequencing depthm
and create an additional matrixm_tpm
with the same cells, but using the TPM matrix from the dataset. Therefore we will have two pseudo-bulk samples, one based on counts and one based on TPM; but both with the same cells included.OK, I hope i did not forget anything, I am looking forward to discuss these changes with you :) Alex