sdv-dev / Copulas

A library to model multivariate data using copulas.
https://sdv.dev/Copulas/
Other
543 stars 107 forks source link

The method 'select_copula' of Bivariate return wrong CopulaType #101

Closed gaboolic closed 5 years ago

gaboolic commented 5 years ago

Description

In copulas/bivariate/base.py class CopulaTypes(Enum):

CLAYTON = 0
FRANK = 1
GUMBEL = 2
INDEPENDENCE = 3

But the method 'select_copula' , is copula_candidates = [frank] copula_candidates.append(clayton) copula_candidates.append(gumbel)

They're not in the same order. When frank.tau > 0 and frank is best, it returns clayton. It should return frank

csala commented 5 years ago

Thanks for reporting this @gaboolic !

It definitely looks like this should be fixed.

Would you mind making a PR fixing it? :-)

My suggested approach would be to return returning copula_candidates[selected_copula].copula_type instead of CopulaTypes(selected_copula):

selected_copula = np.argmax(cost_LR)
selected_theta = theta_candidates[selected_copula]
return copula_candidates[selected_copula].copula_type, selected_theta
gaboolic commented 5 years ago

OK,Thanks for your prompt reply

csala commented 5 years ago

This issue seems to go much deeper than expected.

I reproduce here a short conversation that happened on a private channel between me and @aliciasun

In the base bivariate copula class, we have the select_copula method that supposedly selects the best copula (Frank, Gumbel or Clayton) doing the following:

Given out candidate copulas the procedure proposed for selecting the one that best fit to a dataset of pairs :math:\{(u_j, v_j )\}, j=1,2,...n , is as follows:

  1. Estimate the most likely parameter :math:\theta of each copula candidate for the given dataset.
  2. Construct :math:R(z|\theta). Calculate the area under the tail for each of the copula candidates.
  3. Compare the areas: :math:a_u achieved using empirical copula against the ones achieved for > the copula candidates. Score the outcome of the comparison from 3 (best) down to 1 (worst).
  4. Proceed as in steps 2- 3 with the lower tail and function :math:L.
  5. Finally the sum of empirical upper and lower tail functions is compared against :math:R + L. Scores of the three comparisons are summed and the candidate with the highest value is selected.

NOTE: These steps are taken directly from the "Copula Graphical Models for Wind Resource Estimation" paper: https://groups.csail.mit.edu/EVO-DesignOpt/groupWebSite/uploads/Main/IJCAI_sub_KV_AC_UM.pdf

However, when looking at the actual code I see:

        z_left, L, z_right, R = cls.compute_empirical(X)
        left_dependence, right_dependence = cls.get_dependencies(
            copula_candidates, z_left, z_right)

        # compute L2 distance from empirical distribution
        cost_L = [np.sum((L - l) ** 2) for l in left_dependence]
        cost_R = [np.sum((R - r) ** 2) for r in right_dependence]
        cost_LR = np.add(cost_L, cost_R)

        selected_copula = np.argmax(cost_LR)

However, if I understand this correctly, the current code is selecting the copula based on a COST instead of a SCORE, which means that the copula that ends up being selected is not the one that is closest to the empirical distribution but rather the one that is further away.

The conclusion reached was:

Yeah I think it is should be reversed. Also to follow the exact description in the paper, each of cost_L/cost_R/cost_LR should be sorted and assigned a rank first (where the smallest element has a rank of 1). Also each of cost_L/cost_R/cost_LR should be compared. Right now the final selection is only based on cost_LR

And

The problem is that cost_LR should not be cost_L + cost_R as it is, but rather the score computed when comparing the sum of the empirical upper and lower tail functions with the candidate equivalent. So we should really compute rank_L, rank_R and rank_LR independently, and then get the one with the highest sum, as you suggested.

Since the issue has grown bigger than we thought at the beginning, I'm assigning it to @JDTheRipperPC , who, a part from fixing the problem, will work on adding tests that reproduce and validate the behavior.