Open moonman925 opened 4 years ago
Hi, For mixture distributions, there generally isn't a closed form for quantiles of mixtures. You could use something like Newton's method with the CDF function to have an approximation to a quantile function. Given a mixture of distributions with known CDFs, the mixture CDF is a weighted sum of the CDF of the component distributions (which I believe the mixture distributions already implement).
Feel free to reopen this if you want to discuss more.
Actually MixtureSameFamily doesn't implement quantile. We could do that the way you suggest. PRs welcome.
@brianwa84 Yep that's what I was mentioning :P. Will change title of this issue to reflect that.
Oh never mind, was thinking backward. Cdf you can complete with weighted sum, and is already implemented. Not quantile.
For the fellow strugglers -- just use pynverse
.
I have a model which predicts a mixture of gaussians with 4 components. Here is how I get quantiles for one observation.
from pynverse import inversefunc
from tensorflow_probability import distributions as tfd
mixture_distribution = tfd.MixtureSameFamily(
mixture_distribution=tfd.Categorical(probs=alpha_pred[0, :]),
components_distribution=tfd.Normal(
loc=mu_pred[0, :],
scale=sigma_pred[0, :]))
cdf_fit = lambda x: mixture_distribution.cdf(x).numpy()
%timeit quantile_func_fit = inversefunc(cdf_fit, domain=[y_train.min(),y_train.max()], accuracy=6)
7.79 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
quantile_func_fit([0.5, 0.9, 0.95, 0.99])
>>> array([15.26007136, 37.84270684, 39.90694571, 42.79034905])
cdf_fit([15.26007136, 37.84270684, 39.90694571, 42.79034905])
>>> array([0.5 , 0.90000004, 0.95 , 0.99 ], dtype=float32)
Good enough for any practical application.
This is one-core computation, using bare-bones numpy. I'm sure a proper tensorflow implementation would be even faster. @brianwa84 why is this issue closed?..
In the meantime I hacked together a rather efficient quantile approximation. USE WITH CAUTION! READ THE CODE
import numpy as np
def mixture_distribution_quantiles(mixture_distribution, probs, N_grid_points=int(1e3)):
# base_grid = get_base_grid(mixture_distribution)
# TODO: need a smarter grid (or you can customize it to your dataset)
base_grid = np.linspace(-5, 20, num=N_grid_points) # hacked in order to fit the simulation below
full_grid = np.transpose(np.tile(base_grid, (mixture_distribution.batch_shape[0], 1)))
cdf_grid = mixture_distribution.cdf(full_grid).numpy() # this is fully parallelized and even uses GPU
grid_check = (cdf_grid.min(axis=0).max() <= min(probs)) & (max(probs) <= cdf_grid.max(axis=0).min())
if not grid_check:
raise RuntimeError('Grid does not span full CDF range needed for interpolation!')
probs_row_grid = np.transpose(np.tile(probs, (cdf_grid.shape[0], 1)))
def get_quantiles_for_one_observation(cdf_grid_one_obs):
return base_grid[np.argmax(np.greater(cdf_grid_one_obs, probs_row_grid), axis=1)]
# TODO: this is the main performance bottleneck. uses only one CPU core
quantiles_grid = np.apply_along_axis(
func1d=get_quantiles_for_one_observation,
axis=0,
arr=cdf_grid,
)
return quantiles_grid
if __name__ == "__main__":
from tensorflow_probability import distributions as tfd
quantiles_to_compute = np.array([0.005, 0.1, 0.25, 0.5, 0.75, 0.9, 0.995])
N_predictions = int(4*1e3)
N_components = 5
N_grid_points = int(1e4)
pi_sim = np.random.uniform(size=(N_predictions, N_components)).astype(np.float32)
pi_sim = pi_sim/np.transpose(np.tile(pi_sim.sum(axis=1), (N_components, 1)))
mean_sim = np.random.uniform(low=5, high=10, size=(N_predictions, N_components)).astype(np.float32)
sd_sim = np.random.uniform(low=0.1, high=1, size=(N_predictions, N_components)).astype(np.float32)
mixture_distribution = tfd.MixtureSameFamily(
mixture_distribution=tfd.Categorical(probs=pi_sim),
components_distribution=tfd.Normal(
loc=mean_sim,
scale=sd_sim,
validate_args=True,
allow_nan_stats=False,
)
)
quantiles_for_each_prediction = mixture_distribution_quantiles(
mixture_distribution=mixture_distribution,
probs=quantiles_to_compute,
N_grid_points=N_grid_points,
)
print(quantiles_for_each_prediction.shape)
# test
one_obs = 100
quantiles_one_obs = quantiles_for_each_prediction[:, one_obs]
probs_validation = [round(mixture_distribution.cdf(quant_val).numpy()[one_obs], 3) for quant_val in quantiles_one_obs]
print(f'''
probs input: {quantiles_to_compute}
probs approx: {probs_validation}
''')
Result:
probs input: [0.005 0.1 0.25 0.5 0.75 0.9 0.995]
probs approx: [0.005, 0.1, 0.251, 0.501, 0.751, 0.9, 0.995]
good enough!
Setup: 16 Xenon cores, lot's of RAM, no GPU. This simulation eats roughly 10GB (peak).
%%timeit
...: quantiles_for_each_prediction = mixture_distribution_quantiles(
...: mixture_distribution=mixture_distribution,
...: probs=quantiles_to_compute,
...: N_grid_points=N_grid_points,
...: )
...:
11.7 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@brianwa84 please re-open the issue!
Reopening on request.
In the meantime I hacked together a rather efficient quantile approximation. USE WITH CAUTION! READ THE CODE
import numpy as np def mixture_distribution_quantiles(mixture_distribution, probs, N_grid_points=int(1e3)): # base_grid = get_base_grid(mixture_distribution) # TODO: need a smarter grid (or you can customize it to your dataset) base_grid = np.linspace(-5, 20, num=N_grid_points) # hacked in order to fit the simulation below full_grid = np.transpose(np.tile(base_grid, (mixture_distribution.batch_shape[0], 1))) cdf_grid = mixture_distribution.cdf(full_grid).numpy() # this is fully parallelized and even uses GPU grid_check = (cdf_grid.min(axis=0).max() <= min(probs)) & (max(probs) <= cdf_grid.max(axis=0).min()) if not grid_check: raise RuntimeError('Grid does not span full CDF range needed for interpolation!') probs_row_grid = np.transpose(np.tile(probs, (cdf_grid.shape[0], 1))) def get_quantiles_for_one_observation(cdf_grid_one_obs): return base_grid[np.argmax(np.greater(cdf_grid_one_obs, probs_row_grid), axis=1)] # TODO: this is the main performance bottleneck. uses only one CPU core quantiles_grid = np.apply_along_axis( func1d=get_quantiles_for_one_observation, axis=0, arr=cdf_grid, ) return quantiles_grid if __name__ == "__main__": from tensorflow_probability import distributions as tfd quantiles_to_compute = np.array([0.005, 0.1, 0.25, 0.5, 0.75, 0.9, 0.995]) N_predictions = int(4*1e3) N_components = 5 N_grid_points = int(1e4) pi_sim = np.random.uniform(size=(N_predictions, N_components)).astype(np.float32) pi_sim = pi_sim/np.transpose(np.tile(pi_sim.sum(axis=1), (N_components, 1))) mean_sim = np.random.uniform(low=5, high=10, size=(N_predictions, N_components)).astype(np.float32) sd_sim = np.random.uniform(low=0.1, high=1, size=(N_predictions, N_components)).astype(np.float32) mixture_distribution = tfd.MixtureSameFamily( mixture_distribution=tfd.Categorical(probs=pi_sim), components_distribution=tfd.Normal( loc=mean_sim, scale=sd_sim, validate_args=True, allow_nan_stats=False, ) ) quantiles_for_each_prediction = mixture_distribution_quantiles( mixture_distribution=mixture_distribution, probs=quantiles_to_compute, N_grid_points=N_grid_points, ) print(quantiles_for_each_prediction.shape) # test one_obs = 100 quantiles_one_obs = quantiles_for_each_prediction[:, one_obs] probs_validation = [round(mixture_distribution.cdf(quant_val).numpy()[one_obs], 3) for quant_val in quantiles_one_obs] print(f''' probs input: {quantiles_to_compute} probs approx: {probs_validation} ''')
Result:
probs input: [0.005 0.1 0.25 0.5 0.75 0.9 0.995] probs approx: [0.005, 0.1, 0.251, 0.501, 0.751, 0.9, 0.995]
good enough!
Performance
Setup: 16 Xenon cores, lot's of RAM, no GPU. This simulation eats roughly 10GB (peak).
%%timeit ...: quantiles_for_each_prediction = mixture_distribution_quantiles( ...: mixture_distribution=mixture_distribution, ...: probs=quantiles_to_compute, ...: N_grid_points=N_grid_points, ...: ) ...: 11.7 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I am using it to reverse
mixture logistic
mixture_distribution = tfd.MixtureSameFamily( mixture_distribution=tfd.Categorical(probs=pi_sim), components_distribution=tfd.Logistic( loc=mean_sim, scale=sd_sim, ) )
it seems that observations in
quantiles_for_each_prediction
are very different. The average mean of the error (probs_validation - quantiles_to_compute
)is bigger than1e-5
. Any suggestions to optimize?
I am using it to reverse
mixture logistic
mixture_distribution = tfd.MixtureSameFamily( mixture_distribution=tfd.Categorical(probs=pi_sim), components_distribution=tfd.Logistic( loc=mean_sim, scale=sd_sim, ) )
it seems that observations in
quantiles_for_each_prediction
are very different. The average mean of the error (probs_validation - quantiles_to_compute
)is bigger than1e-5
. Any suggestions to optimize?
Increase N_grid_points
until sufficient precision reached. If you run out of memory, you can just batch the grid.
If performance is not an issue, use pynverse
as in my other example above.
Also: logistic distribution is similar to normal, so the example above works, but generally you want to have an appropriate grid for the linear interpolation. E.g. it should cover the full domain of your mixture, but at the same time not be too broad, so you have enough points.
pynverse
solves this problem in a very general fashion, however it's not vectorised, so you will quickly end up having scalability issues.
Are there any plans to implement an approximation to the quantile function of MixtureSameFamily
, perhaps using information from the components_distribution
for efficiency?
any updates on this? Was surprised to find out that even Student does not have quantile
in tfp, while scipy does
PRs are welcome, but we're not pursuing a quantile function right now for the Student T. We could use a solver along with the CDF function as was discussed, but that might have surprising runtime characteristics.
Any plans to implement quantile?
when will the inverse cdf of these kinda distribution quantile be implemented? I have nightly version installed but found quantile not implemented.