settylab / Mellon

Non-parametric density inference for single-cell analysis.
https://mellon.readthedocs.io
GNU General Public License v3.0
51 stars 1 forks source link

Uncertainty #7

Closed katosh closed 9 months ago

katosh commented 11 months ago

The core objective of this PR is to introduce uncertainty estimation into Mellon's primary results.

New Features

with_uncertainty Parameter

Integrates a boolean parameter with_uncertainty across all estimators: DensityEstimator, TimeSensitiveDensityEstimator, FunctionEstimator, and DimensionalityEstimator. It modifies the fitted predictor, accessible via the .predict property, to include the following methods:

gp_type Parameter

Introduces the gp_type parameter to all relevant estimators to explicitly specify the Gaussian Process (GP) sparsification strategy, replacing the previously used method argument (with options auto, fixed, and percent) that implicitly controlled sparsification. The available options for gp_type include:

This new parameter adds additional validation steps, ensuring that no contradictory parameters are specified. If inconsistencies are detected, a helpful reply guides the user on how to fix the issue. The value can be either a string matching one of the options above or an instance of the mellon.parameters.GaussianProcessType Enum. Partial matches log a warning, using the closest match. Defaults to 'sparse_cholesky'.

Note: Nyström strategies are not applicable to the FunctionEstimator.

y_is_mean Parameter

Adds a boolean parameter y_is_mean to FunctionEstimator, affecting how y values are interpreted:

This change benefits DensityEstimator, TimeSensitiveDensityEstimator, and DimensionalityEstimator where function values are predicted for out-of-sample locations after mean GP computation.

check_rank Parameter

Introduces the check_rank parameter to all relevant estimators. This boolean parameter explicitly controls whether the rank check is performed, specifically in the gp_type="sparse_cholesky" case. The rank check assesses the chosen landmarks for adequate complexity by examining the approximate rank of the covariance matrix, issuing a warning if insufficient. Allowed values are:

The default setting aims to bypass unnecessary computation when the number of landmarks is so abundant that insufficient complexity becomes improbable.

normalize Parameter

The normalize parameter is applicable to both the .mean method and .__call__ method within the mellon.Predictor class. When set to True, these methods will subtract log(number of observations) from the value returned. This feature is particularly useful with the DensityEstimator, where normalization adjusts for the number of cells in the training sample, allowing for accurate density comparisons between datasets. This correction takes into account the effect of dataset size, ensuring that differences in total cell numbers are not unduly influential. By default, the parameter is set to False, meaning that density differences due to variations in total cell number will remain uncorrected.

normalize_per_time_point Parameter

This parameter fine-tunes the TimeSensitiveDensityEstimator to handle variations in sampling bias across different time points, ensuring both continuity and differentiability in the resulting density estimation. Notably, it also allows to reflect the growth of a population even if the same number of cells were sampled from each time point.

The normalization is realized by manipulating the nearest neighbor distances nn_distances to reflect the deviation from an expected cell count.

Options:

Notes:

Enhancements

Changes

codecov[bot] commented 11 months ago

Codecov Report

Attention: 27 lines in your changes are missing coverage. Please review.

Comparison is base (c274b7f) 92.65% compared to head (924aa5a) 97.47%. Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7 +/- ## ========================================== + Coverage 92.65% 97.47% +4.81% ========================================== Files 34 35 +1 Lines 2547 3518 +971 ========================================== + Hits 2360 3429 +1069 + Misses 187 89 -98 ``` | [Files](https://app.codecov.io/gh/settylab/Mellon/pull/7?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=settylab) | Coverage Δ | | |---|---|---| | [mellon/\_\_init\_\_.py](https://app.codecov.io/gh/settylab/Mellon/pull/7?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=settylab#diff-bWVsbG9uL19faW5pdF9fLnB5) | `100.00% <100.00%> (ø)` | | | [mellon/\_conditional.py](https://app.codecov.io/gh/settylab/Mellon/pull/7?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=settylab#diff-bWVsbG9uL19jb25kaXRpb25hbC5weQ==) | `100.00% <ø> (ø)` | | | [mellon/\_inference.py](https://app.codecov.io/gh/settylab/Mellon/pull/7?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=settylab#diff-bWVsbG9uL19pbmZlcmVuY2UucHk=) | `100.00% <ø> (ø)` | | | [mellon/\_parameters.py](https://app.codecov.io/gh/settylab/Mellon/pull/7?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=settylab#diff-bWVsbG9uL19wYXJhbWV0ZXJzLnB5) | `100.00% <ø> (ø)` | | | [mellon/\_util.py](https://app.codecov.io/gh/settylab/Mellon/pull/7?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=settylab#diff-bWVsbG9uL191dGlsLnB5) | `100.00% <ø> (ø)` | | | [mellon/base\_cov.py](https://app.codecov.io/gh/settylab/Mellon/pull/7?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=settylab#diff-bWVsbG9uL2Jhc2VfY292LnB5) | `94.40% <100.00%> (+0.38%)` | :arrow_up: | | [mellon/base\_model.py](https://app.codecov.io/gh/settylab/Mellon/pull/7?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=settylab#diff-bWVsbG9uL2Jhc2VfbW9kZWwucHk=) | `93.06% <100.00%> (+8.55%)` | :arrow_up: | | [mellon/compute\_ls\_time.py](https://app.codecov.io/gh/settylab/Mellon/pull/7?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=settylab#diff-bWVsbG9uL2NvbXB1dGVfbHNfdGltZS5weQ==) | `100.00% <100.00%> (ø)` | | | [mellon/cov.py](https://app.codecov.io/gh/settylab/Mellon/pull/7?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=settylab#diff-bWVsbG9uL2Nvdi5weQ==) | `100.00% <100.00%> (ø)` | | | [mellon/decomposition.py](https://app.codecov.io/gh/settylab/Mellon/pull/7?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=settylab#diff-bWVsbG9uL2RlY29tcG9zaXRpb24ucHk=) | `91.93% <100.00%> (+8.15%)` | :arrow_up: | | ... and [18 more](https://app.codecov.io/gh/settylab/Mellon/pull/7?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=settylab) | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

katosh commented 11 months ago

@ManuSetty The refactoring is done. I made the chages I mentioned and I removed the argument method (with options auto, fixed, and percent) in favor of gp_type that lets the user choose the sparsification method explicitly:

    gp_type : str or GaussianProcessType
        The type of sparcification used for the Gaussian Process
         - 'full' None-sparse Gaussian Process
         - 'full_nystroem' Sparse GP with Nyström rank reduction without landmarks,
            which lowers the computational complexity.
         - 'sparse_cholesky' Sparse GP using landmarks/inducing points,
            typically employed to enable scalable GP models.
         - 'sparse_nystroem' Sparse GP using landmarks or inducing points,
            along with an improved Nyström rank reduction method that balances
            accuracy with efficiency.

        The value can be either a string matching one of the above options or an instance of
        the `mellon.parameters.GaussianProcessType` Enum. If a partial match is found with the
        Enum, a warning will be logged, and the closest match will be used.
        Defaults to 'sparse_cholesky'.

This comes with an additional parameter validation making sure no contradictory parameters are specified.

katosh commented 10 months ago

Commit 27b7d6386835cbbb51719a2a357b41e2b249247f resolves a major ambiguity harmonizing the new uncertainty computation for the DensityEstimator with the noise input handling of the FunctionEstimator.