wbonat / mcglm

Fitting multivariate covariance generalized linear models
GNU General Public License v3.0
21 stars 5 forks source link

data structure coding recomendations, and link functions #16

Closed bhurley6ehs closed 4 years ago

bhurley6ehs commented 5 years ago

Hi!

I have 8 response variables I will be testing against an independent variable coded a "status". This status variable is of ordinal data structure (r calls this ordered factor I think) of 4 levels. Alternatively I have dummy variabled this into four separate columns of binary variables. My question here is, is one way better than the other within mcglm?

Secondly, of my response variables, six of the eight are binary 1/0. I THINK mcglm wants me to code these as numeric, but I wanted to check first. One other is "number of years in the project (max of 8). So the entire range is from 0 - 8 years in the project. If you can let me know how to code this as well that would be great, and finally I have another ordered factor variable, coded 0,1,2,3 increasing.

Finally, what is the typical link and variance function for binary 1/0 data? I assume logit or probit?

Is there a resource I can find information on the various link and variance functions?

Thank you for all your help!

bhurley6ehs commented 5 years ago

Hi all;

Just checking in on this.

wbonat commented 5 years ago

Hi!

I have 8 response variables I will be testing against an independent variable coded a "status". This status variable is of ordinal data structure (r calls this ordered factor I think) of 4 levels. Alternatively I have dummy variabled this into four separate columns of binary variables. My question here is, is one way better than the other within mcglm?

If I understood correctly, probably the dummy is the most conventional strategy, but it depends on your goals.

Secondly, of my response variables, six of the eight are binary 1/0. I THINK mcglm wants me to code these as numeric, but I wanted to check first. One other is "number of years in the project (max of 8). So the entire range is from 0 - 8 years in the project. If you can let me know how to code this as well that would be great, and finally I have another ordered factor variable, coded 0,1,2,3 increasing.

For coding see my paper on JSS.

Finally, what is the typical link and variance function for binary 1/0 data? I assume logit or probit?

Again, it depends on the kind of interpretation you want. The most popular is logit link function.

Is there a resource I can find information on the various link and variance functions?

In my paper on JSS I discuss what link functions are implemented. The mcglm is just an extension of GLM, so everything you know from GLM is also true for the mcglm.

Thank you for all your help!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/wbonat/mcglm/issues/16?email_source=notifications&email_token=ABX7MALBXWBDJ7AYJVLYIXLQBNIFXA5CNFSM4IHGWCOKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HBZF2YA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABX7MAMUOXCX3DGVOBV6VRLQBNIFXANCNFSM4IHGWCOA .

bhurley6ehs commented 5 years ago

Thank you much for the answer!

A few errors I am getting include:

For the "number of years in project" column, when trying to add this to an MCGLM: Error : beta contains 1 missing values Error: $ operator is invalid for atomic vectors

For the binary response variables (Do you have a scholarship? 0/No, 1/YES): Error in family$linkfun(mustart) : Argument mu must be a nonempty numeric vector In addition: Warning message: In Ops.factor(model.response(temp), Ntrial[[i]]) : ‘*’ not meaningful for factors

bhurley6ehs commented 5 years ago

**Wagner:

Follow-up: As I had asked earlier, a few of my multivariate response variables are 1/0. I have tried all day to have MCGLM accept them as factor-type data, to no avail. The error reported is:**

Error in family$linkfun(mustart) : Argument mu must be a nonempty numeric vector In addition: Warning message: In Ops.factor(model.response(temp), Ntrial[[i]]) : ‘*’ not meaningful for factors

**It WILL accept them as numeric-type data.

I guess my question is: If my response variable is binary (yes/no, 1/0), will type numeric work within MCGLM? Or should it absolutely remain type factor?**

bhurley6ehs commented 5 years ago

Hello again:

I have gotten my data into the MCGLM! an error that is arising is this:

Error in .local(x, ...) : internal_chm_factor: Cholesky factorization failed

I think this is because one of my variables has a lot of zeroes (30 "1's" and 147 "0's" out of 177 cases).

Do you think this may be the case, and if so, are there any workarounds?

wbonat commented 5 years ago

Probably, no. It is a common error of algorithm convergence. If you could provide a reproducible example I can try to help you.

Em qui, 1 de ago de 2019 às 17:45, bhurley6ehs notifications@github.com escreveu:

Hello again:

I have gotten my data into the MCGLM! an error that is arising is this:

Error in .local(x, ...) : internal_chm_factor: Cholesky factorization failed

I think this is because one of my variables has a lot of zeroes (30 "1's" and 147 "0's" out of 177 cases).

Do you think this may be the case, and if so, are there any workarounds?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wbonat/mcglm/issues/16?email_source=notifications&email_token=ABX7MAN2XWIGQGOO75EAEHLQCNDM7A5CNFSM4IHGWCOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3L23PA#issuecomment-517451196, or mute the thread https://github.com/notifications/unsubscribe-auth/ABX7MAMYVLK677LC3PC5ZETQCNDM7ANCNFSM4IHGWCOA .

-- Dr. Wagner Hugo Bonat

Programa de Especialização em Data Science e Big Data Laboratório de Estatística e Geoinformação (LEG) Universidade Federal do Paraná (UFPR)

bhurley6ehs commented 5 years ago

I would be happy to! Would you like the .CSV, the variables in the question, and the question being asked itself?

wbonat commented 5 years ago

Yes.

Em sex, 2 de ago de 2019 às 14:01, bhurley6ehs notifications@github.com escreveu:

I would be happy to! Would you like the .CSV, the variables in the question, and the question being asked itself?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wbonat/mcglm/issues/16?email_source=notifications&email_token=ABX7MANT25NQ2DWXIGLFCMDQCRR5TA5CNFSM4IHGWCOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3OJ3RI#issuecomment-517774789, or mute the thread https://github.com/notifications/unsubscribe-auth/ABX7MAIIGHH7LNHC5EPYIVLQCRR5TANCNFSM4IHGWCOA .

-- Dr. Wagner Hugo Bonat

Programa de Especialização em Data Science e Big Data Laboratório de Estatística e Geoinformação (LEG) Universidade Federal do Paraná (UFPR)

bhurley6ehs commented 5 years ago

Github doesn't support .CSV, so here is the .XLSX.

We are testing whether the variable LAST_LEGAL (legal status of the respondent, either left as a FACTOR variable or dummyed to four 1/0 columns: CIT_BINARY, TPS_BINARY, DACA_BINARY, UNDOC_BINARY) has any effect on the following columns (data structure in parentheses):

YRS_DP (years in the project as an INTEGER) STEM (did the respondent major in a STEM field, binary FACTOR, 0/1; 0 NO, 1 YES) FIN_AID (level of financial aid given, 0,1,2 as an ORDERED FACTOR FIN_AID OR also dummyed to 3 0/1 columns: FIN_AID_0, FIN_AID_1, FIN_AID_2) C_SCHOL (Did the respondent receive a college scholarship, binary FACTOR, 0/1; 0 NO, 1 YES) HS_SCHOL (Did the respondent receive a high school scholarship, binary FACTOR, 0/1; 0 NO, 1 YES) F_SCHOL ((Did the respondent receive a Foundation scholarship, binary FACTOR, 0/1; 0 NO, 1 YES) DS_BENE (level of benefits the respondent received from the project, 1,2,3 as an ORDERED FACTOR DS_BENE OR also dummyed to 3 0/1 columns: DS_BENE_1, DS_BENE_2, DS_BENE_3) PJ_COLL (did the respondent have a paid job in college, binary FACTOR, 0/1; 0 NO, 1 YES)

The question being asked: Does LAST_LEGAL (either as a FACTOR or dummyed) have any effect on the above listed variables?

Cut_Down_for_r.xlsx

wbonat commented 5 years ago

Hi The following code should help you! For some reason I could not fit using all your covariates. It is probably because you do not have enough data in all classes. mcglm does not support multinomial and missing data. For the other cases, the code is below.

Loading data set

data <- read.table("Exe.csv", header = TRUE, dec = ".", sep = ",") head(data)

Checkin collumns class

sapply(data, class)

Loading mcglm package

require(mcglm)

Linear predictor per response

yrs <- YRS_DP ~ LAST_LEGAL + CIT_BINARY + TPS_BINARY stem <- STEM ~ LAST_LEGAL + CIT_BINARY + TPS_BINARY cshol <- C_SCHOL ~ LAST_LEGAL + CIT_BINARY + TPS_BINARY hsshol <- HS_SCHOL ~ LAST_LEGAL + CIT_BINARY + TPS_BINARY fshol <- F_SCHOL ~ LAST_LEGAL + CIT_BINARY + TPS_BINARY pjshol <- PJ_COLL ~ LAST_LEGAL + CIT_BINARY + TPS_BINARY

missing values are not supported

data_nomiss <- na.exclude(data)

Matrix linear predictor

Z0 <- mc_id(data_nomiss)

Multivariate model fit

fit <- mcglm(c(yrs, stem, cshol, hsshol, fshol, pjshol), matrix_pred = list(Z0,Z0,Z0,Z0,Z0,Z0), link = c("log",rep("logit",5)), variance = c("tweedie", rep("binomialP", 5)), control_algorithm = list(verbose = TRUE, max_it = 100), data = data_nomiss) summary(fit)

Em sex, 2 de ago de 2019 às 15:37, bhurley6ehs notifications@github.com escreveu:

Github doesn't support .CSV, so here is the .XLSX.

We are testing whether the variable LAST_LEGAL (legal status of the respondent, either left as a FACTOR variable or dummyed to four 1/0 columns: CIT_BINARY, TPS_BINARY, DACA_BINARY, UNDOC_BINARY) has any effect on the following columns (data structure in parentheses):

YRS_DP (years in the project as an INTEGER) STEM (did the respondent major in a STEM field, binary FACTOR, 0/1; 0 NO, 1 YES) FIN_AID (level of financial aid given, 0,1,2 as an ORDERED FACTOR FIN_AID OR also dummyed to 3 1/0 columns: FIN_AID_0, FIN_AID_1, FIN_AID_2) C_SCHOL (Did the respondent receive a college scholarship, binary FACTOR, 0/1; 0 NO, 1 YES) HS_SCHOL (Did the respondent receive a high school scholarship, binary FACTOR, 0/1; 0 NO, 1 YES) F_SCHOL ((Did the respondent receive a Foundation scholarship, binary FACTOR, 0/1; 0 NO, 1 YES) DS_BENE (level of benefits the respondent received from the project, 1,2,3 as an ORDERED FACTOR DS_BENE OR also dummyed to 3 1/0 columns: DS_BENE_1, DS_BENE_2, DS_BENE_3) PJ_COLL (did the respondent have a paid job in college, binary FACTOR, 0/1; 0 NO, 1 YES)

The question being asked: Does LAST_LEGAL (either as ORDERED FACTOR or dummyed) have any effect on the above listed variables?

Cut_Down_for_r.xlsx https://github.com/wbonat/mcglm/files/3462488/Cut_Down_for_r.xlsx

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wbonat/mcglm/issues/16?email_source=notifications&email_token=ABX7MAJXJFMYXRRLSVEQTLLQCR5HBA5CNFSM4IHGWCOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3OQ7EQ#issuecomment-517803922, or mute the thread https://github.com/notifications/unsubscribe-auth/ABX7MANDZI275GMFSL4CJBLQCR5HBANCNFSM4IHGWCOA .

-- Dr. Wagner Hugo Bonat

Programa de Especialização em Data Science e Big Data Laboratório de Estatística e Geoinformação (LEG) Universidade Federal do Paraná (UFPR)

bhurley6ehs commented 5 years ago

Thank you much for this! I will be able to run this here probably on Monday and see if there are any follow up questions.

I would like to ask that, when you run a summary of the analysis, it produces a large table with the variable "RhoXXX" and in "Rho210"....."Rho186"... Etc.

What exactly are these?

wbonat commented 5 years ago

These are the correlations between responses, for example rho12 is the correlation between the first and second response variables.

Em qua, 7 de ago de 2019 às 17:01, bhurley6ehs notifications@github.com escreveu:

Thank you much for this! I will be able to run this here probably on Monday and see if there are any follow up questions.

I would like to ask that, when you run a summary of the analysis, it produces a large table with the variable "RhoXXX" and in "Rho210"....."Rho186"... Etc.

What exactly are these?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wbonat/mcglm/issues/16?email_source=notifications&email_token=ABX7MAIEOAAPADUIKR3LYTLQDMSYJA5CNFSM4IHGWCOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3ZRN2A#issuecomment-519247592, or mute the thread https://github.com/notifications/unsubscribe-auth/ABX7MANKVW3QN2XKYCILZIDQDMSYJANCNFSM4IHGWCOA .

-- Dr. Wagner Hugo Bonat

Programa de Especialização em Data Science e Big Data Laboratório de Estatística e Geoinformação (LEG) Universidade Federal do Paraná (UFPR)

bhurley6ehs commented 5 years ago

For some reason I could not fit using all your covariates. It is probably because you do not have enough data in all classes. mcglm does not support multinomial and missing data.

Could this be because if the other three variables are 0, the final fourth variable is always 1? Put another way, any combination of three variables will always perfectly predict the final fourth variable?

wbonat commented 5 years ago

Yes! Exactly!

Em qui, 8 de ago de 2019 às 16:45, bhurley6ehs notifications@github.com escreveu:

For some reason I could not fit using all your covariates. It is probably because you do not have enough data in all classes. mcglm does not support multinomial and missing data.

Could this be because if the other three variables are 0, the final fourth variable is always 1? Put another way, any combination of three variables will always perfectly predict the final fourth variable?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/wbonat/mcglm/issues/16?email_source=notifications&email_token=ABX7MALKBYHBE3DU36IHCITQDRZW3A5CNFSM4IHGWCOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD34WOQA#issuecomment-519661376, or mute the thread https://github.com/notifications/unsubscribe-auth/ABX7MAO7NCV7MZOK4C2OFRTQDRZW3ANCNFSM4IHGWCOA .

-- Dr. Wagner Hugo Bonat

Programa de Especialização em Data Science e Big Data Laboratório de Estatística e Geoinformação (LEG) Universidade Federal do Paraná (UFPR)

bhurley6ehs commented 5 years ago

Hi Wagner:

I managed to get some results (attached), and wanted to see if my interpretations were correct.

FYI I added the DACA_BINARY column, and removed the TPS_BINARY and LAST_LEGAL columns from the Independent variables.

Under the summary, basically all I can see that is significant is YRS_DP for DACA_BINARY, and PJ_COLL for DACA_BINARY and CIT_BINARY

Could you help me interpret the ANOVA output? I think I get the RHO function.

output_MCGLM.txt