tnagler / VineCopula

Statistical inference of vine copulas
87 stars 32 forks source link

Vectorize parameters #68

Closed notEvil closed 4 years ago

notEvil commented 4 years ago

Hi! I'm working with vine copulas where the parameters are not constant. So, for instance, I would like to get a random sample where each sample corresponds to different parameters (like rnorm(10, mean=1:10)), or the likelihood where each observation corresponds to different parameters (like dnorm(1:10, mean=1:10)). As a proof of concept I took the code of RVineSim and rewrote the C part in R (straight forward without optimization). Then I adjusted the code to use 3d arrays in family, par and par2 and it worked. Are there any good reasons why this is not implemented?

tnagler commented 4 years ago

There is not a very good reason why this is not implemented. There are some:

notEvil commented 4 years ago

I mostly agree with this point of view. What if only par and par2 are vectorized where it makes sense. For instance, add par and par2 as optional arguments to RVinePDF and RVineSim (like in dnorm and rnorm). Of course, the C-code needs adjustment, but the gains here outweigh the efforts imo. In case of RVineSim the C-code may even be replaced by R-code because the outer loop is replaced by vectorized calls to Hfunc and Hinv.

tnagler commented 4 years ago

While doing what you want just for RVineSim would be fairly straightforward, doing it for all functions in the library is a huge task. From my experience, it should take at least 3-4 days of full-time work (vectorizing the BiCop family took me much longer).

I like your idea and would be happy to accept/help with a PR, but I will most probably not do it myself.

notEvil commented 4 years ago
* If we vectorize over parameters, we also need to vectorize over the family (for e.g., a Clayton with Kendall's tau going from negative to positive).

I disagree, family defines the distribution while par and par2 parameterize it. Maybe in mixture models this would be nice, but in the usual context the distribution is fixed.

* I don't like additional arguments at all. We already specify the model in the `RVM` argument, adding more arguments containing overlapping information is confusing. (Also `dnorm` has only one `mean` and `sd` argument, no duplicated arguments for the case where you want to vectorize.)

I get it. Yet more confusing would be an RVineMatrix with 3d par and par2 in the general context (just like a univariate normal with multiple means). So the right approach would be to separate the distribution definition from the parameters (e.g. extract par and par2 from RVineMatrix). This would be a fairly easy (add parameters to functions) but backwards incompatible change.

* C-code should not be replaced by R-code. This would make things slower for 99% of the users. Modifying the C-code should be fairly straightforward anyway. The bulk of the work would be R-infrastructure: checking inputs, converting to arrays, handling errors/edge cases etc.

R-code is easier to maintain, so if things can be done in R, they should be done in R. In this case, the R-version wouldn't be much slower than the C-version because there is no big loop.

If there is a viable solution, I'd draft a PR.

tnagler commented 4 years ago

I disagree, family defines the distribution while par and par2 parameterize it. Maybe in mixture models this would be nice, but in the usual context the distribution is fixed.

In any sensible copula model with varying parameters, the range of dependence should include both positive and negative Kendall's taus. That is true for Gaussian/Student/Frank. For all others, we would have an unnecessary restriction to either only positive or only negative dependence.

An issue is the somewhat arbitrary definition of "families" in this library. We could just as well define "Clayton" as family = 3 for par > 0 and family = 23 for par < 0. Similarly, we could patch together 3/33 or 13/23 etc. (This is what is done in gamCopula, yielding four versions of "Clayton" that all allow for par in [-infty, infty]). Defining families that way would be weird for fixed-parameter models (two out of four models are identical for fixed par). But in the varying parameter case this is absolutely what we want.

To mimic that behavior, we need to vectorize over the family (that's why the BiCop functions are vectorized over family). Otherwise the whole feature is quite useless for families other than Gaussian/Student/Frank.

Yet more confusing would be an RVineMatrix with 3d par and par2 in the general context (just like a univariate normal with multiple means). So the right approach would be to separate the distribution definition from the parameters (e.g. extract par and par2 from RVineMatrix). This would be a fairly easy (add parameters to functions) but backwards incompatible change.

Yeah that would break almost all existing code, so extracting parameters is not really an option. I don't really see any other way than RVineSim(100, RVM = list(RVM1, ...., RVM100)).

R-code is easier to maintain, so if things can be done in R, they should be done in R. In this case, the R-version wouldn't be much slower than the C-version because there is no big loop.

Everything in this library can be done in R, but the C-parts exist for a good reason. I could also live with a solution that calls C for fixed-parameter models and uses R-code for varying-parameter models.

notEvil commented 4 years ago

Thanks for explaining the reasons for vectorized family!

I'll draft a PR with focus on RVineSim then.