The issue of underestimation of standard deviation in simulated data by VineCopula

Harper77777 commented 8 months ago

We have recently encountered two issues while addressing the problem of fitting copulas in 2800 dimensions. Fortunately, we have found pre-existing code packages for this purpose, and we have used both R and Python to fit the copulas and compare the results. During our experimentation, we encountered the following two issues:

1、The dimensionality of 2800 is too high, making it impossible to fit the copula all at once. Instead, we have to split the fitting process into 100 separate fits. However, this approach may lead to some dimensions' correlations not being effectively captured. We would like to inquire whether there are effective solutions for such high-dimensional problems.

2、After completing the separate fits, we generated ten thousand random numbers. We found that while the generated random numbers closely matched the empirical data in terms of mean values, the standard deviations were severely underestimated. We are unsure of the reason behind this issue. Is this behavior normal? If so, are there any solutions available?

We are immensely grateful for your contributions to the field of multidimensional copulas, and we have greatly benefited from them. We look forward to your response.

tnagler commented 8 months ago

Hi.

For such high-dimensional vines, you should use the trunc_lvl parameter, which cuts off the vine at a certain depth. From experience, relatively small values of around 20 can already give very good results, but immensely reduce memory/time demand of the algorithms.
What standard deviations are you referring to? The ones of all 2800 variables, or the ones of some summary (like a weighted average of variables)?

Harper77777 commented 7 months ago

Hi,thank you for your reply. 1.We want to find the correlation between 2800 variables using pyvinecopulib. However, in our initial attempts, we found that the computational resources were insufficient to handle the input of 2800 dimensions, and hence couldn't produce results. We will further try the parameters you mentioned. Also, we would like to confirm whether pyvinecopulib can handle such high-dimensional data at present.

2.As we couldn't process such high-dimensional data at once initially, we split the 2800 dimensions into 100 sets of 28 variables each. We fitted vine copulas separately for each set, and after fitting, we concatenated the results using rank-based methods to connect them to the previous dimension. We then compared the fitted data with the real data, and while the means are generally similar, there is a significant difference in standard deviation. The standard deviation refers to the aggregation of the 2800 variables into the previous dimension (e.g., from county to city level), and then comparing the standard deviation of the fitted data at the city level with that of the real data.

We look forward to your response and appreciate your assistance.

tnagler commented 7 months ago

The library itself has no restriction on the dimension of the problem, the restrictions come from your computing hardware (memory/time budget). The library has been used with thousands of variables, although mostly with using the trunc_lvl parameter, so that's what you should try.
When you aggregate variables, but miss some dependence (for example because you split into 100 separate sub-models - effectively treating many things as independent that aren't), then the standard deviation can be off. So that's most likely what's happening.

Harper77777 commented 7 months ago

Alright, thank you for your response. I tried using the trunc_lvl parameter, but I'm not quite sure what the standard for selecting it should be. If the computer hardware can't support constructing copulas with 2800-dimensional variables, does that mean there's no other way? Looking forward to your reply.

tnagler commented 7 months ago

You can try the select_trunc_lvl option for automatic selection by the method of this paper.

Harper77777 commented 6 months ago

Thank you. I read your paper and used the following code:fit_controls = pv.FitControlsVinecop( family_set=[pv.BicopFamily.gaussian, pv.BicopFamily.student, pv.BicopFamily.clayton, pv.BicopFamily.gumbel, pv.BicopFamily.frank, pv.BicopFamily.joe], select_trunc_lvl=True, selection_criterion="mbic" ) Although my data has a high dimensionality, each dimension has only a small sample size of 10. My process involves fitting marginal distributions first, then importing the CDF values for vine copula fitting. After that, I generate 10,000 random numbers and transform them back to the original values. However, even though I use truncation parameters and mBIC as the model selection criterion, when comparing the randomly simulated numbers with the original values, I still encounter the problem of underestimating the standard deviation in each dimension. Could you please tell me if it's an issue with my usage or with the data itself?

tnagler commented 6 months ago

The code looks alright. Underestimating the stddev in each dimension has nothing to do with copulas, but with the fitted marginal distributions. In general, with 10 observations and 2800 dimensions there is very little to reliably extract information.

tvatter commented 1 month ago

This issue seems stall and, since it involves marginal distributions rather than the (vine) copula itself, looks like unrelated to pyviencopulib itself. I'm closing for now, but feel free to reopen if needed.

vinecopulib / pyvinecopulib

The issue of underestimation of standard deviation in simulated data by VineCopula #123