sdv-dev / Copulas

A library to model multivariate data using copulas.
https://sdv.dev/Copulas/
Other
555 stars 106 forks source link

not getting the correct pdf? #254

Open huazuhao opened 3 years ago

huazuhao commented 3 years ago

Copulas version: 0.5.0 Python version: 3.7.4

I hope to customize a Gaussian copula by using the from_dict() function.

To test out this feature, I am trying to generate a Gaussian copula with a covariance matrix that is a 3*3 identity matrix. I specify the marginals as three beta distributions with some parameters.

Then, I know that because the covariance matrix is the identity matrix, I can also compute the joint distribution of those three random variables by simply multiplying those random variables.

However, the problem is that the result from those two methods don't match.

Here is the code:

import numpy as np
from copulas.multivariate import GaussianMultivariate
from copulas.multivariate import Multivariate
import pandas as pd

cov_matrix = np.eye(3)

#parameters for the beta distributions
a1 = 0.5
b1 = 0.09

a2 = 0.56
b2 = 0.11

a3 = 6
b3 = 0.3

univerates = [{'loc': 0,
  'scale': 1,
  'a': a1,
  'b': b1,
  'type': 'copulas.univariate.beta.BetaUnivariate'},
 {'loc': 0,
  'scale': 1,
  'a': a2,
  'b': b2,
  'type': 'copulas.univariate.beta.BetaUnivariate'},
 {'loc': 0,
  'scale': 1,
  'a': a3,
  'b': b3,
  'type': 'copulas.univariate.beta.BetaUnivariate'}]

parameters = {}
parameters['covariance'] = cov_matrix
parameters['univariates'] = univerates
parameters['type'] = 'copulas.multivariate.gaussian.GaussianMultivariate'
parameters['columns'] = ['x', 'y', 'z']

new_dist = Multivariate.from_dict(parameters)

data_points = pd.DataFrame()
data_points['x'] = [0.9]
data_points['y'] = [0.8]
data_points['z'] = [0.9]

new_dist.pdf(data_points)

from the above code, the pdf I get at (0.9,0.8,0.9) is 0.024958292096167078 based on the copula package.

However, by directly computing the pdf from three independent beta distributions, I get 0.521737492922503 as the pdf at (0.9,0.8,0.9).

Here is the code for directly computing the pdf

from scipy.stats import beta
beta.pdf(0.9,a1,b1)*beta.pdf(0.8,a2,b2)*beta.pdf(0.9,a3,b3)

Thank you.

npatki commented 3 years ago

Very strange! Using the same setup, I see that the methods match when I use cdf but notpdf.

More interestingly, under-the-hood, both are calling either stats.multivariate_normal.cdf or stats.multivariate_normal.pdf with the same variables (see cdf and pdf definitions).

There appears to be a bug somewhere but I'm not sure where. I'm leaving this open while we continue to investigate. Please update if you figure anything out!

huazuhao commented 3 years ago

Hi, You are correct that the cdf is correct. I use the R copula package and doubled checked with this python package and the cdf from the two sides match. Yeah, kind of strange as to what is going on. Indeed, under-the-hood, everything looks correct from first glance. But I guess there must be a problem somewhere, otherwise the pdf would also be correct.