Vignette - Githubissues

FFinotello commented 2 years ago

Hi @alex-d13 the vignette and the tool look very good!

I have a few suggestions/questions:

[x] Sfaira: you could briefly explain what it is when you mention it, so users that do not know it, understand whether they need it or not;
[x] In Sfaira you say there are >200 datasets: where users can find some info on which datatasets, tissues, cell types, etc. are available?
[x] How can be accessed (i.e., how to set the organisms and tissues parameters)?
[x] What does the in name parameter in dataset_sfaira_multiple?
[x] Why, in your example 2 datasets are downloaded?
[ ] "You can also control the number of cells per sample by using the total_read_counts parameter [...]" -> It should be reads.
[ ] The reasoning behind the total_read_counts parameter for TPM is unclear to me. TPM, by definition, sum up to 1e6 (can be <1e6 in case some genes were filtered out). While for counts it is clear that they are directly proportional to the number of sequencing reads, TPM are not (although the presence of 0's is inversely proportional to the sequencing depth).
[x] The random simulation does not look random to me but "learnt" from the input single-cell data. Is it like this?
[x] We should have an option for the simulation to upload a tab-delimited file with the wanted cell fractions (cell-type by sample matrix with proportions, summing up to 100% in each sample).
[x] The spillover simulation framework is very useful for spillower analysis, but I would see the "spillover" as a downstream effect (i.e., a bias in deconvolution methods) rather than a characteristics of the simulated data. Maybe a name like unique or pure could work better? Other ideas?
[ ] In the spike-in simulation the fraction for one cell type is fixed, but how are the other cell type fractions modeled?
[ ] Does the generated pseudo_bulk data contains TPM or counts? And what about the ExpressionSet object?
[x] The whitelist option is very cool! Similarly, we could have a blacklist option to exclude some cell types (e.g., "other" or "ambiguous" cells) without specifying a long whitelist.

Very cool job! Looking forward to discussing with you and the team possible ideas :)

Cheers, Francesca

alex-d13 commented 2 years ago

Thanks for the great feedback, Francesca :)

Some points I can already comment on:

The reasoning behind the total_read_counts parameter for TPM is unclear to me. TPM, by definition, sum up to 1e6 (can be <1e6 in case some genes were filtered out). While for counts it is clear that they are directly proportional to the number of sequencing reads, TPM are not (although the presence of 0's is inversely proportional to the sequencing depth).

So the way I use this parameter is this: if you set it to lets say 1e7, I will sample cells from the dataset and summing up their total read count (meaning the sum of the expression values in the matrix == column sums). With TPMs I guess you would get smaller values in the matrix, which means you can sample more cells until the limit of 1e7 is reached compared to raw counts, which have higher values and therefore fewer cells can be sampled. Does this make sense? Do you feel like I should change the implementation in a way?

--> This also why I say, you can

control the number of cells

with this parameter.

The random simulation does not look random to me but "learnt" from the input single-cell data. Is it like this?

Correct. But if I would just perform random sampling from all existing cell-types, I would get almost identical results to the uniform scenario, right?

In the spike-in simulation the fraction for one cell type is fixed, but how are the other cell type fractions modeled?

I am using the "random" scenario (or 'learnt from database' as you proposed) for the other cell-types.

Does the generated pseudo_bulk data contains TPM or counts? And what about the ExpressionSet object?

I currently only output TPMs, which I normalize by myself. Both in the pseudo_bulk and the ExpressionSet.

FFinotello commented 2 years ago

Hi @alex-d13 thanks a lot for your explanations.

I am a bit underwater these days, but I will get back to you soon.

One important point where we should plan our strategy carefully is the types of input and output data.

Input scRNA-seq. As we accept both 10x and Smart-Seq2 data, where TPM makes sense for the latter but not the former and different normalization approaches are available, I would set that the users have to provide raw count data. This would also solve the issue with the total_read_counts above.

Output pseudo-bulk RNA-seq. We should output data normalized depending on the type of data we expect to be accepted by deconvolution methods (e.g., TPM, counts, FPKM/RPKM). Bulk data are not inputs of the simulator (sorry for my inconvenient explanation before), but output pseudobulk should mimic their possible formats. We can explore additional normalizations, which could be controlled by a specific parameter.

One question would be if we can generate robustly all possible data types starting from the single-cell counts, as well as generating some internal nromalized scRNA-seq data to be used in specific settings (e.g. to apply Census effectively).

Curious what you, @mlist, and @grst think about it!

Cheers, Francesca

alex-d13 commented 2 years ago

Hi Francesca,

a few comments on my side:

I would set that the users have to provide raw count data.

This would definitely make some things easier, also with census. Though census is "meant" to use on TPM/RPKM normalized data, but I can run a few tests already how the results using census change with raw and TPM input data. Also I would have to filter sfaira beforehand to only include raw expression matrices.

Input bulk RNA-seq.

So for now I did not really implement a way of using bulk data, since I require some kind of cell annotation. Would you say that we then request a different type of annotation for the different bulk samples? Or just use the data without annotation data, but then I cannot run any of the simulation functions. I guess the idea was to upload bulk data have be able to compare it directly with simulated data coming from scRNA-seq, is that right?

Internal normalized scRNA-seq and output signature matrix

I am not sure if I understand that point correctly: by internal normalized scRNA-seq do you mean the way we normalize a dataset before using it for the simulation?

Best, Alex

grst commented 2 years ago

@FFinotello, what is your idea with the bulk data? I don't quite see why the simulator would need it.

FFinotello commented 2 years ago

@grst and @alex-d13 are right - I did not express myself well and I will amend now my text above.

FFinotello commented 2 years ago

Hi @alex-d13!

Do we have an updated version of the manual describing the main inputs, outputs, and arguments?

It would be extremely important to plan the benchmarking studies in human (@kathireinisch) and mouse (@LorenzoMerotto) and decide whether there are points of possible improvement.

Thanks, Francesca

omnideconv / SimBu

Vignette #9