Package improvements - Githubissues

alex-d13 commented 2 years ago

Hi everyone,

Francesca and me had a discussion on some points we should/need to improve in the package. I will try to explain our issues and solutions here, please comment on anything unclear or if you would do it differently (@mlist):

store counts and TPMs parallel in dataset currently the dataset can only hold a single count matrix, this can either be TPM, CPM, raw or what ever else. We thought it would be useful to have two slots in the dataset: count_matrix and tpm_matrix. For the tpm_matrix we would expect TPM data, so we could add a small check function, to see if the counts of each cell could be TPM data. As a simple idea could be to check if X <= tpm_counts <= 1e6, where X is a new parameter set to 7*1e5 by default.
change sequencing depth calculation We currently have 2 "stopping-criteria" when the cells are sampled for a pseudo-bulk sample: the number of already sampled cells (ncells) and the sum of the counts of the already sampled cells (total_read_counts). We found out that the current implementation of the total_read_counts might remove the mRNA bias in the pseudo-bulk sample, so we came up with something new: We still have both arguments as above, but now you can either use only ncells or ncells with total_read_counts combined: lets say we have a simulation vector with (B cells = 0.6, T cells = 0.4), ncells=1000 and total_read_counts=1e7. We will then sample 0.6 1000 B cells and 0.4 1000 T cells with replacement and create a matrix m of all genes and 1000 cells. This is the standard approach when only using ncells. Summing up the counts in m per gene gives on pseudo-bulk sample (Nothing new until now). With the addition of total_read_counts, we will now have two options:
- sum(pseudo_bulk_sample) > total_read_counts: we rarefy the counts in the sample to the wanted sequencing depth
- sum(pseudo_bulk_sample) < total_read_counts: we multiply the pseudo-bulk sample with k to get to our wanted sequencing depth
- else: do nothing
how to use TPMs and counts together Lets say we have a dataset with counts & TPM data: when simulating a sample, we thought to take the approach as explained above based on the count data. We can store the cell-IDs which make up the final matrix m and create an additional matrix m_tpm with the same cells, but using the TPM matrix from the dataset. Therefore we will have two pseudo-bulk samples, one based on counts and one based on TPM; but both with the same cells included.
scaling of pseudo-bulk sample We should scale only the sample generated with TPMs to 1e6, and leave the sample based on counts with the generated counts.
change Census to _expressedgenes Straight forward, do you think we should do more checks here?
cell and cell type specific scaling factors The scaling factor calculations based on count data (number of reads and number of expressed genes) produce values for each cell individually. This means we can re-scale each cell. Compared to that, approaches like EPIC give values for cell-types. We could add the option, that the count based approaches also calculate a value for cell-types by taking the median of all values of a specific cell-type. No big change, but would be easy to implement and maybe useful.

OK, I hope i did not forget anything, I am looking forward to discuss these changes with you :) Alex

FFinotello commented 2 years ago

Hi @alex-d13

Thanks for the nice summary! Might be a good idea to update the manual and vignette accordingly, especially specifying the mandatory and optional arguments, as well as their default value (e.g. for total_read_counts I would consider something like "none" or NULL).

Cheers, Francesca

mlist commented 2 years ago

Hi @FFinotello, Alex and I just discussed these points. I agree with almost everything. For the first point I suggested that we use the Bioconductor ExpressionSet class which can handle multiple matrices by default (e.g., counts, TPM and others). Also we should stick to Bioconductor classes since I'd like us to submit this as a Bioconductor package.

I disagree with renaming census to expressed genes. Here I think we should stick to the original since the benchmark of the census and other methods is also part of the paper.

FFinotello commented 2 years ago

Hello!

Another current limitation of our approach could be this: we add a cell type-specific mRNA bias to counts and TPM... but we also have an approach to estimate mRNA scaling factors from counts, which implies that (at least) counts are already affected by mRNA bias.

If so, I would add to SimBu a parameter (e.g. remove_count_bias, set to TRUE by default) to first remove (not add!) the from single-cell intrinsic bias measured with the "Census" approach from counts, not TPM, and then add to both counts and TPM the user-specified bias. Does this make sense?

As this is a delicate but important step, I would probably assess its impact thoroughly with the usual scatterplots of cell fractions where we can see the impact of of mRNA bias and its correction. We could subject to this analysis both TPM and counts, considering the following scenarios:

Simulation of bulk counts and TPM without adding mRNA bias (counts should show some "intrinsic" bias, while TPM not);
... bulk counts and TPM with added Census bias (right bias for TPM and too much for counts?);
... bulk counts with removal of "intrinsic" bias, TPM as they are (right bias for both?).

Does this make sense to you guys? @mlist @alex-d13 @grst

mlist commented 2 years ago

I'm not sure I get it. Could you elaborate on this in our next meeting?

alex-d13 commented 2 years ago

If so, I would add to SimBu a parameter (e.g. remove_count_bias, set to TRUE by default) to first remove (not add!) the from single-cell intrinsic bias measured with the "Census" approach from counts, not TPM, and then add to both counts and TPM the user-specified bias. Does this make sense?

This would mean we will not offer Census as an option to add a bias, right? Otherwise we would first remove a bias with Census and then add the same again later.

Also, I would maybe calculate Census on the TPMs (if the user uploaded them), since Census is designed to work with TPM data.

We could subject to this analysis both TPM and counts, considering the following scenarios:

I am also a little bit confused with this section, as Markus said maybe we can discuss this in the next meeting :)

FFinotello commented 2 years ago

This would mean we will not offer Census as an option to add a bias, right? Otherwise we would first remove a bias with Census and then add the same again later.

It is a bit more complicated than this because:

For TPM, we do not remove the bias but only add;
For counts, bias is always removed (when requested) from single cells, whereas the addition of Census bias can be done per cell or cell type;
The user can request a different way of removing count bias;
... we have to test the options above an see whether my thoughts make sense ;)

Also, I would maybe calculate Census on the TPMs (if the user uploaded them), since Census is designed to work with TPM data.

This is a good idea. Looking at the data shared by Lorenzo, I am thinking that we could have in addition to Census also the following normalization scores (the user can then pick one):

total counts
total number of expressed genes (i.e. counts > 0) Both can be calculated per cell or cell type as Census scores.

And, sure we can discuss the scenarios! Also to check together my reasoning make sense :D

FFinotello commented 2 years ago

Hello! @alex-d13 and I have decided to go for a first assessment of the intrinsic mRNA bias... if any ;)

Meanwhile, I will report here a few suggestions I noted down when reading the first vignette:

[x] I find a bit confusing the nomenclature of the uniform and random distribution. Shall we rename uniform to even and random to `uniform (as this is the one actually sampled from a uniform distribution)?
[x] Maybe not so easy to distinguish between the "spike-in" approaches to quantify mRNA bias or to simulate cell fractions. The spike_in (cell fraction) simulated scenario could be maybe called controlled? Also related arguments (e.g. spike_in_cell_type and spike_in_amount) should be updated accordingly.
[x] We could try to use a consistent nomenclature in the parameters, e.g. "spike-in" vs. "spike_in", uppercase vs. lowercase arguments (Neatpicky-mode:ON :) ).
[x] We should specify which arguments are mandatory and what is the default of each parameter.
[ ] A summary of input and output files would be helpful.
[ ] Is total_read_counts a number or can it be a vector of length == num. samples?
[ ] Is unique_cell_type a single string or can be a vector of strings of length == num. samples?
[ ] What is simulation_vector? Could we give it a more meaningful name?

alex-d13 commented 2 years ago

I find a bit confusing the nomenclature of the uniform and random distribution. Shall we rename uniform to even and random to `uniform (as this is the one actually sampled from a uniform distribution)?

I agree with the first change (uniform to even), but I would leave the random scenario name. After all i think the specific feature of it is that the cell type fractions are random. Maybe if we had included different distributions from which to sample for the random scenario, renaming it to the distributions name would be useful. For me the name uniform implies more that the cell types are spread out uniformly and not that it is sampled from the uniform distribution.

Is total_read_counts a number or can it be a vector of length == num. samples? Is unique_cell_type a single string or can be a vector of strings of length == num. samples?

Both of them are currently only a single value. If you want to generate simulations with different sequencing depth, you will have to run the function multiple times and then merge the results (I did add a method to merge simulations :))

What is simulation_vector? Could we give it a more meaningful name?

Thats where you specify the cell type fractions you want in the simulation. What could be a different name for this?

FFinotello commented 2 years ago

Thats where you specify the cell type fractions you want in the simulation. What could be a different name for this?

cell_type_fractions?! :)

omnideconv / SimBu

Package improvements #15