jstriaukas / midasml

midasml package is dedicated to run predictive high-dimensional mixed data sampling models
38 stars 22 forks source link

cv.panel.sglfit arguments setting #11

Closed Yuanyuan77-wang closed 1 year ago

Yuanyuan77-wang commented 1 year ago

The explanation of "gindex" is p by 1 vector indicating group membership of each covariate. But I failed to understand how to set it accurately, could you give me some suggestions?

jstriaukas commented 1 year ago

Hi,

here is the example which is also available in the description file:

set.seed(1) x = matrix(rnorm(100 * 20), 100, 20) beta = c(5,4,3,2,1,rep(0, times = 15)) y = x%*%beta + rnorm(100) gindex = sort(rep(1:4,times=5)) cv.panel.sglfit(x = x, y = y, gindex = gindex, gamma = 0.5, method = "fe", nf = 10, standardize = FALSE, intercept = FALSE)

If you pool x and y, then 'gindex' is just a vector that contains memberships of x.

Hope this helps.

Jonas

Yuanyuan77-wang commented 1 year ago

Thanks! I would like to make a confirmation : x_a (daily frequency contains 4 covariates) , x_b(month frequency contains 9 covariates) Both are generated by mixed_freq_data_single(I use legendre_degree = 3L),then I rbind x_a and x_b and take it as x ,so I should set gindex = sort(rep(1:13,times=4)) ? Is this correct?

jstriaukas commented 1 year ago

If I understand correctly, you have 13 high-frequency covariates, which you pool and for which you apply Legendre polynomials of degree 3. In this case, you correctly construct your group membership index 'gindex'.

Yuanyuan77-wang commented 1 year ago

If I understand correctly, you have 13 high-frequency covariates, which you pool and for which you apply Legendre polynomials of degree 3. In this case, you correctly construct your group membership index 'gindex'.

Thanks a lot ! I've been running cv.panel.sglfit for a week and it still doesn't get results, but no errors.Considering x is NT*p (N=200, T=44,p=13),is it because the data is too large? Another confusion I have is the aggregation of panel data. I used "cbind "to aggregate 200 individuals,is this correct? I am lucky to know your research one year ago,and you have been eager to answer my questions, which has helped me a lot and encouraged me to have a try,you are so kind and thank you again!

jstriaukas commented 1 year ago

regarding pooling panel data: the way I do it is, for each x_ik, construct MIDAS using mixed_freq_data_single, apply Legendre, rbind across k, then cbind across i. You end up having pooled MIDAS x.

regarding computation time: try panel.sglfit first, without cv. If it is still stuck: check x. do you have very large outliers? NA? other irregularities. If it is the cv that makes it so long (which I doubt but could be...), try using ic (information criteria) - since in panels you have huge sample size of NT (in your case 200*44), ic should work ok.

let me know how it goes and if you find what causes HUGE computation times, please let me know. I will incorporate this into the package then...

Jonas

Yuanyuan77-wang commented 1 year ago

I will try again,then give feedback in time.THANK YOU for your generous help!

jstriaukas commented 1 year ago

another thing that might speed up your computations is to set lambda parameters less tight: nlambda = 20, lambda.factor = 1e-02 defaults are: nlambda = 100, lambda.factor = 1e-04

then, you need to check if cvm output gives you a curve for which the minimum is not at the ends of lambda values.

Yuanyuan77-wang commented 1 year ago

Thanks for your reminder,I tried a smaller N first, ic. can run out the result,but cv. can't ; Then I use N=200, It's still running(I use ic. this time).I will try another lambda next.THANK YOU!

jstriaukas commented 1 year ago

First, try to run sglfit with the settings I mentioned. Since you work with fairly large and long panel, lambda.factor is set to 1e-4. In this case - I would guess - you end up estimating too dense models which take much more time to fit.

Thus, playing around with lambda.factor and nlambda will give you a good understanding what values lead to fast, yet not too restrictive, estimation.

So, kill all the code that runs and try my suggestions.


From: 原、 @.> Sent: Friday, October 28, 2022 9:42:07 AM To: jstriaukas/midasml @.> Cc: Jonas Striaukas @.>; State change @.> Subject: Re: [jstriaukas/midasml] cv.panel.sglfit arguments setting (Issue #11)

Thanks for your reminder,I tried a smaller N first, ic. can run out the result,but cv. can't ; Then I use N=200, It's still running(I use ic. this time).I will try another lambda next.THANK YOU!

— Reply to this email directly, view it on GitHubhttps://github.com/jstriaukas/midasml/issues/11#issuecomment-1294607160, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFAD4IDF3W2RBJEGICTURWLWFN7U7ANCNFSM6AAAAAARKDHSZI. You are receiving this because you modified the open/close state.Message ID: @.***>

Yuanyuan77-wang commented 1 year ago

I'll take your suggestions and give feedback if any happens .THANK YOU !!!

Curley-l commented 1 year ago

Hi jstriaukas, Thank you so much for sharing the useful R package. Recently, I find the cv.sglfit and tscv.sglfit functions take a long time to estimate the model, and I use 23 explanatory variables from 2002 to 2021 to forecast the GDP, the dataset is not very large. Also, I found the lasso method is a little bit sensitive to the values of lambda. The estimation will not be accurate enough if I reduce the nlambda. So may I ask that is there any way to improve the speed of the function running? Thank you so much

jstriaukas commented 1 year ago

try checking if you don't have large outliers in the data. Typically, this is the problem.

As for accuracy, LASSO is solved using coordinate descent and warm-starts. How do you measure the optimization (I guess, because for estimation you need to know true DGP) accuracy for empirical data?

On Thu, 10 Nov 2022 at 04:45, justly129 @.***> wrote:

Hi jstriaukas, Thank you so much for sharing the useful R package. Recently, I find the cv.sglfit and tscv.sglfit functions take a long time to estimate the model, and I use 23 explanatory variables from 2002 to 2021 to forecast the GDP, the dataset is not very large. Also, I found the lasso method is a little bit sensitive to the values of lambda. The estimation will not be accurate enough if I reduce the nlambda. So may I ask that is there any way to improve the speed of the function running? Thank you so much

— Reply to this email directly, view it on GitHub https://github.com/jstriaukas/midasml/issues/11#issuecomment-1309729210, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAD4IGBDUZHCSYG3F5DZWTWHRVVVANCNFSM6AAAAAARKDHSZI . You are receiving this because you modified the open/close state.Message ID: @.***>

--

Jonas Striaukas

Assistant Professor of Statistics

Department of Finance

Copenhagen Business School

https://jstriaukas.github.io/

Yuanyuan77-wang commented 1 year ago

try checking if you don't have large outliers in the data. Typically, this is the problem. As for accuracy, LASSO is solved using coordinate descent and warm-starts. How do you measure the optimization (I guess, because for estimation you need to know true DGP) accuracy for empirical data? On Thu, 10 Nov 2022 at 04:45, justly129 @.> wrote: Hi jstriaukas, Thank you so much for sharing the useful R package. Recently, I find the cv.sglfit and tscv.sglfit functions take a long time to estimate the model, and I use 23 explanatory variables from 2002 to 2021 to forecast the GDP, the dataset is not very large. Also, I found the lasso method is a little bit sensitive to the values of lambda. The estimation will not be accurate enough if I reduce the nlambda. So may I ask that is there any way to improve the speed of the function running? Thank you so much — Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAD4IGBDUZHCSYG3F5DZWTWHRVVVANCNFSM6AAAAAARKDHSZI . You are receiving this because you modified the open/close state.Message ID: @.> -- Jonas Striaukas Assistant Professor of Statistics Department of Finance Copenhagen Business School https://jstriaukas.github.io/

Yeah, dataset is an important factor , I re-cleaned my data set, and then ic.panel.sglift can run faster, but the cv is still slow, I guess the panel data cross-validation calculation is too large.

Curley-l commented 1 year ago

Thank you so much for your quick response. Maybe the monthly macroeconomic data for 2020 are considered outliers because of the outbreak of COVID-19, but the cv is still slow when estimating the dataset in 2019. Yes, I use the real data of GDP and estimates to calculate the root mean square error to measure the accuracy. Also, I use the Factor MIDAS model proposed by Marcellino and Schumacher (2010) as benchmark model. I found the performance of Factor Unrestricted MIDAS model is better than the SG-LASSO MIDAS model. That's why I would like to amplify the value of nlambda to try other possible outcomes,

jstriaukas commented 1 year ago

try checking if you don't have large outliers in the data. Typically, this is the problem. As for accuracy, LASSO is solved using coordinate descent and warm-starts. How do you measure the optimization (I guess, because for estimation you need to know true DGP) accuracy for empirical data? On Thu, 10 Nov 2022 at 04:45, justly129 @.**> wrote: Hi jstriaukas, Thank you so much for sharing the useful R package. Recently, I find the cv.sglfit and tscv.sglfit functions take a long time to estimate the model, and I use 23 explanatory variables from 2002 to 2021 to forecast the GDP, the dataset is not very large. Also, I found the lasso method is a little bit sensitive to the values of lambda. The estimation will not be accurate enough if I reduce the nlambda. So may I ask that is there any way to improve the speed of the function running? Thank you so much — Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAD4IGBDUZHCSYG3F5DZWTWHRVVVANCNFSM6AAAAAARKDHSZI . You are receiving this because you modified the open/close state.Message ID: @.**> -- Jonas Striaukas Assistant Professor of Statistics Department of Finance Copenhagen Business School https://jstriaukas.github.io/

Yeah, dataset is an important factor , I re-cleaned my data set, and then ic.panel.sglift can run faster, but the cv is still slow, I guess the panel data cross-validation calculation is too large.

I am currently running another application using panel data set, which is similar in dimension as yours. I run for ~30 quarters out-of-sample, run cv + ic, also pooled, fixed effects and another variation of panel model, and get ALL results within one day.

If your data is not proprietary, happy to check your data/code to improve your estimation :)

jstriaukas commented 1 year ago

Thank you so much for your quick response. Maybe the monthly macroeconomic data for 2020 are considered outliers because of the outbreak of COVID-19, but the cv is still slow when estimating the dataset in 2019. Yes, I use the real data of GDP and estimates to calculate the root mean square error to measure the accuracy. Also, I use the Factor MIDAS model proposed by Marcellino and Schumacher (2010) as benchmark model. I found the performance of Factor Unrestricted MIDAS model is better than the SG-LASSO MIDAS model. That's why I would like to amplify the value of nlambda to try other possible outcomes,

Interesting finding. FA-UMIDAS might work better for your data. For our JBES paper application, which is I guess similar as yours in terms of sample sizes (what matters for speed), all out of sample results are computed in less than an hour, i.e., all horizons, gamma = {0,0.01, ..., 1} (101 gammas) and 100 lambdas per gamma with 5 fold cv.

Yuanyuan77-wang commented 1 year ago

It's an honor that you are willing to help me check the data, I will send it to your Gmail later.

jstriaukas commented 1 year ago

@Yuanyuan77-wang: write to me at jonas.striaukas@gmail.com — thanks

Yuanyuan77-wang commented 1 year ago

@Yuanyuan77-wang: write to me at jonas.striaukas@gmail.com — thanks I'm so sorry, I may have failed to send yesterday and have resent it to you. Thanks.

Owlundermoon commented 1 year ago

Hello jstriaukas,

Recently, I have been reading your paper "Machine Learning Time Series Regressions With an Application to Nowcasting" and I have been trying to reproduce the Monte Carlo simulation results. I have found that when the sample size is 200 and Legendre degree is 10, the cv.sglfit function takes more than 60 times longer to produce results compared to the baseline case or when Legendre degree is 5.

I was wondering if you had encountered this issue when conducting the simulation, and if you have any suggestions for how to speed up the cv.sglfit function. Any advice or insight you can provide would be greatly appreciated. Thank you very much for your time and consideration.

jstriaukas commented 1 year ago

Hello,

I did not have such issues. Ofc, degree 10 leads to longer compute times, but linearly with the increase in the number of parameters.

I guess there is some issue in your code which leads to prolonged computation times.


From: jiang @.> Sent: Thursday, March 16, 2023 6:23:04 AM To: jstriaukas/midasml @.> Cc: Jonas Striaukas @.>; State change @.> Subject: Re: [jstriaukas/midasml] cv.panel.sglfit arguments setting (Issue #11)

Hello jstriaukas,

Recently, I have been reading your paper "Machine Learning Time Series Regressions With an Application to Nowcasting" and I have been trying to reproduce the Monte Carlo simulation results. I have found that when the sample size is 200 and Legendre degree is 10, the cv.sglfit function takes more than 60 times longer to produce results compared to the baseline case or when Legendre degree is 5.

I was wondering if you had encountered this issue when conducting the simulation, and if you have any suggestions for how to speed up the cv.sglfit function. Any advice or insight you can provide would be greatly appreciated. Thank you very much for your time and consideration.

— Reply to this email directly, view it on GitHubhttps://github.com/jstriaukas/midasml/issues/11#issuecomment-1471284825, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFAD4IG5SRVJ3VUN6EQSB73W4KISRANCNFSM6AAAAAARKDHSZI. You are receiving this because you modified the open/close state.Message ID: @.***>

Owlundermoon commented 1 year ago

Thank you very much for your response. I have checked my code and did not find any potential issues. However, I was able to reproduce the results for the baseline scenario in your paper, so the code for the baseline scenario should be correct. As for the case with degree 10, I believe I just need to change the degree parameter L from 3 to 10 in the baseline code. Could you please confirm if my understanding is correct?

Also, I am a graduate student in Zhongnan University of Economics and Law from China, and I am very interested in the sparse group lasso method mentioned in your paper. If possible, could you please share with me the code for the Monte Carlo simulation section of the paper, so that I can better understand the sparse group lasso method in your paper? If it is not possible, I completely understand and still appreciate your help.

jstriaukas commented 1 year ago

yes, you should only need to change L to 10. That is strange.

I am away and have it on my hard drive so I could only check it in the summer, unfortunately.

FYI. L=10 is not particularly interesting for MIDAS as L=3 is enough to estimate w's accurately.

On Thu, 16 Mar 2023 at 13:47, jiang @.***> wrote:

Thank you very much for your response. I have checked my code and did not find any potential issues. However, I was able to reproduce the results for the baseline scenario in your paper, so the code for the baseline scenario should be correct. As for the case with degree 10, I believe I just need to change the degree parameter L from 3 to 10 in the baseline code. Could you please confirm if my understanding is correct?

Also, I am a graduate student in Zhongnan University of Economics and Law from China, and I am very interested in the sparse group lasso method mentioned in your paper. If possible, could you please share with me the code for the Monte Carlo simulation section of the paper, so that I can better understand the sparse group lasso method in your paper? If it is not possible, I completely understand and still appreciate your help.

— Reply to this email directly, view it on GitHub https://github.com/jstriaukas/midasml/issues/11#issuecomment-1471808082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAD4IGZIP2IFFOAOH42V63W4L4WTANCNFSM6AAAAAARKDHSZI . You are receiving this because you modified the open/close state.Message ID: @.***>

--

Jonas Striaukas

Assistant Professor of Statistics

Department of Finance

Copenhagen Business School

https://jstriaukas.github.io/

Owlundermoon commented 1 year ago

Thank you very much for your help with this issue.

Owlundermoon commented 1 year ago

Thank you very much for your help with this issue.

Owlundermoon commented 1 year ago

Hi jstriaukas, I have a few questions concerning your paper titled "Machine Learning Time Series Regressions With an Application to Nowcasting" Firstly, in the empirical section of the paper, specifically in Section Five where you analyze the U.S. quarterly GDP data, I was curious about the criteria you used to determine the 12 lags for the macroeconomic monthly data, a Lagrange polynomial of degree 3, and the autoregressive order of 5 for the low-frequency quarterly data. It appears that these details were not explicitly elaborated upon in the text. Secondly, I find your methodology quite intriguing and would like to attempt a similar analysis using Chinese data. However, I have encountered some challenges with my code, as the results do not align with my expectations. Additionally, when attempting the Monte Carlo simulation presented in the fourth part of the paper, my results closely approximate the Mean Squared Forecast Error (MSFE) as seen in Table A in the appendix, but there appears to be a substantial difference in standard errors. Therefore, if it is possible, would you be willing to share the R code used for this section of the paper? Thank you very much for your time and consideration.

jstriaukas commented 1 year ago

How do you compute 'standard errors'?

Owlundermoon commented 1 year ago

Using the formula for sample standard error, the sample standard error is equal to the sample standard deviation divided by the square root of the sample size. In this paper, if the predicted sample size is 50, it is the sample standard deviation of 50 predicted values divided by the square root of 50. Since 5000 simulations were also conducted, the standard error results from the five thousand model runs were averaged to obtain the final result.