soedinglab / prosstt

PRObabilistic Simulations of ScRNA-seq Tree-like Topologies
http://dx.doi.org/10.1093/bioinformatics/btz078
GNU General Public License v3.0
25 stars 11 forks source link

Estimating PROSSTT parameters #12

Closed lazappi closed 5 years ago

lazappi commented 5 years ago

Hi

I have just been looking at PROSSTT which I think is a really cool model. Do you have any procedure for estimating the parameters it uses from a real dataset? I would like to see how the simulations PROSSTT produces match various kinds of data.

Thanks

galicae commented 5 years ago

Hey Luke!

first of all, congrats on defending and thanks for setting up https://www.scrna-tools.org - it was a godsend! Thank you for the kind words - PROSSTT can learn hyperparameters from a real dataset using learn_data_summary. Pasting the documentation:

def learn_data_summary(cell_stats, gene_stats, relative_means):
    """
    Learns hyperparameters from gene and cell summaries of a real dataset. The
    simulated dataset with these hyperparameters will have similar summary
    statistics with the input dataset.

    Parameters
    ----------
    cell_stats: Series
        Each column is a cell. Contains at minimum the rows:
        - "total": sum of all UMIs in each cell.
        - "zeros": count of genes with 0 reported UMIs in each cell.
    gene_stats: Series
        Each column is a gene. Contains at minimum the rows:
        - "means": average expression of each gene over the dataset
        - "var": variance of each gene over the dataset
        - "zeros": count of cells with 0 reported UMIs for each gene
    relative_means: Series
        Relative mean expression for all genes on every lineage tree branch

    Returns
    -------
    scale parameters
        Mean and variance of the library size distribution
    average alpha
        The average alpha hyperparameter for gene variance
    average beta
        The average beta hyperparameter for gene variance
    proposed mean expression
        The proposed base expression for each gene
    """

you can see the function being used in the examples/ directory (all notebooks that start with "comparing_"). I calculated cell_stats and gene_stats with a small awk script I hacked, but of course you can calculate it (much) more conveniently with Python.

hope this helps, Niko

lazappi commented 5 years ago

Hi Niko

Thanks! That sounds like what I was looking for, should have looked a bit harder. 😸

lazappi commented 5 years ago

The learn_data_summary function doesn't seem to be in the latest release version (unless I am missing something?). Do you recommend cloning the repository and installing from that?

galicae commented 5 years ago

yeah, it is better to just clone the repo and go from there. Last release was before submission :sweat_smile: