Closed lazappi closed 5 years ago
Hey Luke!
first of all, congrats on defending and thanks for setting up https://www.scrna-tools.org - it was a godsend! Thank you for the kind words - PROSSTT can learn hyperparameters from a real dataset using learn_data_summary
. Pasting the documentation:
def learn_data_summary(cell_stats, gene_stats, relative_means):
"""
Learns hyperparameters from gene and cell summaries of a real dataset. The
simulated dataset with these hyperparameters will have similar summary
statistics with the input dataset.
Parameters
----------
cell_stats: Series
Each column is a cell. Contains at minimum the rows:
- "total": sum of all UMIs in each cell.
- "zeros": count of genes with 0 reported UMIs in each cell.
gene_stats: Series
Each column is a gene. Contains at minimum the rows:
- "means": average expression of each gene over the dataset
- "var": variance of each gene over the dataset
- "zeros": count of cells with 0 reported UMIs for each gene
relative_means: Series
Relative mean expression for all genes on every lineage tree branch
Returns
-------
scale parameters
Mean and variance of the library size distribution
average alpha
The average alpha hyperparameter for gene variance
average beta
The average beta hyperparameter for gene variance
proposed mean expression
The proposed base expression for each gene
"""
you can see the function being used in the examples/ directory (all notebooks that start with "comparing_"). I calculated cell_stats
and gene_stats
with a small awk script I hacked, but of course you can calculate it (much) more conveniently with Python.
hope this helps, Niko
Hi Niko
Thanks! That sounds like what I was looking for, should have looked a bit harder. 😸
The learn_data_summary
function doesn't seem to be in the latest release version (unless I am missing something?). Do you recommend cloning the repository and installing from that?
yeah, it is better to just clone the repo and go from there. Last release was before submission :sweat_smile:
Hi
I have just been looking at PROSSTT which I think is a really cool model. Do you have any procedure for estimating the parameters it uses from a real dataset? I would like to see how the simulations PROSSTT produces match various kinds of data.
Thanks