Poisson WQS Regression Unstable Indices and Rsolnp issues

phargarten2 commented 5 years ago

From: Ariane Guilbert, Université de Bordeaux Date: May 10, 2019, 11:40 AM

Dear Mr. Hargarten,

I use your miWQS package and its estimate.wqs() function to study the relationships between several endocrine disruptors and child neurodevelopment (test score). Based on the distribution of my dependent variable, I apply the family option “Poisson” (as an approximation of the negative binomial distribution). However, I face some problems. If I run several times the estimate.wqs() function:

I get wqs index values very unstable, sometimes highly significant, sometimes not (I did not set a seed to check the stability of the results);

I frequently get the following error message which aborts the calculations

Error in if (mac(tol + tt[1], tt[2] - tt[3])) <= 0) { :
 missing value where TRUE/FALSE needed.

What might be the reasons for such issues: the approximation of the negative binomial distribution of the outcome by a Poisson distribution? Exposure variables weakly related to the outcome? Exposure variables having opposite effects on the outcome (while beta1 is expected to be either positive or negative)?

I thank you very much for your answer and for having developed this tool.

phargarten2 commented 5 years ago

From: Paul M. Hargarten To: Ariane Guilbert, Université de Bordeaux Date: May 17, 2019 5:30 PM

Mr. Guilbert,

Thank you for using the miWQS package in your research and letting me know of some potential issues. First, I believe that the error message you are receiving

Error in if (max(c(tol + tt[1], tt[2] - tt[3])) <= 0) { 
missing value where TRUE/FALSE needed}.

is from the solnp() function in the Rsolnp package, which is used in estimating the weights and other WQS parameters in the training set. I suggest that you use traceback() and contact the authors of the Rsolnp for more help: Alexios Ghalanos alexios@4dscape.com and Stefan Theussl. I can't seem to find the error in the Rsoln code upon searching the function. If the code is in an internal function of Rsolnp, use the getAnywhere() function to access it.

Previous simulation studies[1, 2] have shown several results that might help your research:

Bootstrapping has improved the variable selection --I believed that you are estimating WQS parameters with bootstrapping as you got different answers. Are you setting the seed before you bootstrap?
WQS regression assumes that the risk is constant between the quantiles.
WQS regression performs poorly with high correlation among exposure variables and a weak association with the outcome. Results improve as the correlation among exposure variables decrease, and the association with the outcome is stronger. So, are your exposures highly correlated?
The constraint of the mixture exists such that the mixture and the outcome have the same direction and avoid the reversal paradox. It could very well be that some exposures have a negative association with the outcome, while others have a positive association. If the constraint is incorrect, the mixture coefficient will go to zero, and no effect will be estimated -- hence the instability. Did you do a series of individual exposure analyses? I was recently working on a function, attached, that is just doing that, and may aid you in finding the appropriate sign.

A few more thoughts:

Please check your sample size; it may be small, which could explain the unstable results.
There is an offset available to you so you are able to model rates. The offset argument of estimate.wqs() function is on the normal scale, so please do not take a logarithm.

Thank you again for using the package and best of luck to your research! I am working on many updates to the package, and I hope to release it to CRAN soon. Let me know if this helps or if you have more questions.

References:

Carrico, C., Gennings, C., Wheeler, D. C., & Factor-Litvak, P. (2014). Characterization of Weighted Quantile Sum Regression for Highly Correlated Data in a Risk Analysis Setting. Journal of Agricultural, Biological, and Environmental Statistics, 20(1), 100–120. https://doi.org/10.1007/s13253-014-0180
Czarnota, J., Gennings, C., & Wheeler, D. C. (2015). Assessment of Weighted Quantile Sum Regression for Modeling Chemical Mixtures and Cancer Risk. Cancer
Informatics, 14, 159–171. https://doi.org/10.4137/CIN.S17295

phargarten2 commented 5 years ago

From: Ariane Guilbert, Université de Bordeaux Date: Tue, May 21, 2019 at 4:02 AM

Dear Mr Hargarten,

Thank you very much for your very helpful answer and sorry for my late reply (I had a meeting with my manager yesterday afternoon and this morning to discuss the topic). Following your remarks:

error coming from the solnp() function in the Rsolnp package : I feel the same way knowing I have a warning message referring to this function when I use the gWQS package on my data (after categorization of the dependent variable). I will contact Alexios Ghalanos and Stefan Theussl.
bootstrapping and seed: I did not set a seed, I wanted to check the stability of the results first.
most of my compounds of interest show a correlation below 0.70 except for some phthalates. I will make a test using the molar sum of these problematic compounds.
based on the "classical" regressions (one model per compound), few compounds appear significantly associated with the outcome.
regarding the constraint on beta, I will try to run your function separately, for compounds showing positive beta and compound showing negative beta in the "classical" regressions. I will also try your function for individual analysis.
my sample size is rather small 100-200 individuals due to missing data on the covariates. I will refine the covariates definition/selection in order to get the best of my data.
I am sorry but I am not sure to understand your last remark: "There is an offset available to you so you are able to model rates. The offset argument of estimate.wqs() function is on the normal scale, so please do not take a logarithm".

Ariane.

phargarten2 commented 5 years ago

Mr. Guilbert: Thank you for your prompt response and some notes.

To your second question, the small sample size, the weak association of the compounds of interest with the outcome, and a mixture of both protective and at-risk chemicals could explain the instability of the WQS index. You are on the right track by doing two different WQS analyses: (1) the protective chemicals (compounds showing negative beta) and the at-risk chemicals (compounds showing positive beta). There should be no need to combine your chemicals; from what you have told me, your correlation pattern may match those already in the literature.

Our function, estimate.wqs(), is not designed for individual analyses; the WQS regression approach assesses a mixture of correlated components with the outcome. A reminder that the missing values in the endocrine disruptors are assumed to all be under some lower bound and are by default placed into the first quantile of the weighted sum. Any missing values in the outcome and covariates are removed before WQS analysis.

My last remark is that estimate.wqs() can model rates using an offset in Poisson regression. All I meant was that in traditional Poisson regression with the offset, the logarithm is taken. In estimate.wqs(), it is not. The likelihood that is maximized consists in part of:

 lambdai <- offset * exp(eta)

, where eta is the linear predictor and lambdai is the mean count for the ith subject.

ArianeGui commented 5 years ago

Dear Mr Hargarten,

When I wrote "I will also try your function for individual analysis" I was referring to the function "analyze_individually.R" you sent me by email (not "estimate.wqs()).

Regarding your last remark, my dependent variable is a test score so I am not affected by the offset term?

Best regards.

phargarten2 commented 5 years ago

Mr. Guilbert:

Thank you for the clarification -- I follow you now.

No, by default, there is no offset. The default value, NULL, is converted to a vector of 1's. It all depends if your test score is a rate or not.

Best wishes.

phargarten2 commented 5 years ago

@ArianeGui wrote this:

Dear Mr. Hargarten,

I contacted you some weeks ago about your miWQS package. I was encountering some issues when running the estimate.wqs function with the Poisson argument (I was frequently getting the following error message, which aborted the calculations ...

Error in if (mac(tol + tt[1], tt[2] - tt[3])) <= 0) { missing value where TRUE/FALSE needed)}

The problem has been solved by refining the definition of my covariates (and so decreasing the number of missing values), grouping some of the exposures by summing their molar concentrations, using tertiles instead of quartiles and by running the function separately, for compounds showing positive beta and compounds showing negative beta in the uni-pollutant regressions (based on negative binomial regression).

However, I am still a bit troubled by one point: when I run the estimate.wqs function with the compounds showing a negative beta in the uni-pollutant regressions and so using the b1.pos = FALSE argument, I still very frequently get a wqs beta index positive. Is it normal?

Thank you very much in advance for your reply.

Best regards.

Ariane Guilbert.

phargarten2 commented 5 years ago

@ArianeGui :

Thank you for faithfully using the miWQS package.

When setting b1.pos = FALSE, did you find a positive WQS beta index in the training set or in the validation set? The beta1 estimate in the training dataset, by looking at element, train.estimates, of an estimate.wqs() object should all be below 0.

Also, have you heard back about the Rsolnp issue or did you no longer need to since you avoided the error by taking several steps as described above?

Thank you, Paul Hargarten

ArianeGui commented 5 years ago

When setting b1.pos = FALSE, did you find a positive WQS beta index in the training set or in the validation set?

I mean the final WQS beta index you get after running the estimate.wqs() function.

Also, have you heard back about the Rsolnp issue or did you no longer need to since you avoided the error by taking several steps as described above?

I sent an email to the author of the function but did not get a feedback. I should contact him again.

Thank you for your reply.

Ariane Guilbert.

phargarten2 commented 5 years ago

Sorry for the delay. If your beta1 log WQS estimates for protective estimates are really high, I think that you are right to question the results. But...the high estimate may be due to small sample size or a big variability of log WQS estimates (is your "positive" effect within 1 standard error?). If there is truly no mixture effect, the log WQS estimate may be positive by chance, so small positive effects are not alarming.

A proxy simulation study using a binary response shows a protective effect that one would expect. The main results are below. The "est.odds" are the Odds Ratios estimated from WQS regression; the "true.odds" are the true odd ratios simulated. The covariates (Z.sim) and the chemicals (X) came from the example dataset in the WQS package, simdata87. Using this data, the outcome had to be re-simulated to show a protective effect.

	est.odds	truth.odds
(Intercept)	0.482	0.287
Age	0.960	1.033
Female	0.818	0.972
Hispanic	1.571	1.716
Non.Hispanic_Others	0.905	1.127
WQS	0.699	0.750

Function calls and main results using print().

> fit.protective <- estimate.wqs(
+   y = y, X = X.true, Z = Z.sim,  
+   proportion.train = 0.5, n.quantiles = 4,
+   place.bdls.in.Q1 = FALSE,
+   B = 100,
+   b1.pos = FALSE, #*********
+   signal.fn = "signal.converge.only",
+   family = "binomial",
+   verbose = FALSE
+ )
#> No missing values in matrix detected. Regular quantiles computed.
> fit.protective

Odd Ratios & 95% CI (N.valid = 500) 
                     Odds Ratio  SE.OR                95% CI  P-value
(Intercept)               0.482   1.48  0.482 (0.223, 1.041)    0.063
Age                       0.960   1.07  0.960 (0.842, 1.094)    0.538
Female                    0.818   1.26  0.818 (0.518, 1.291)    0.388
Hispanic                  1.570   1.26  1.571 (0.996, 2.476)    0.052
Non.Hispanic_Others       0.905   1.32  0.905 (0.527, 1.554)    0.718
WQS                       0.699   1.18  0.699 (0.503, 0.971)    0.033
AIC:  511.0522 

All (100) bootstraps have converged. 

 Weights Adjusted by signal.converge.only using N.train = 500 observations: 
pentachlorophenol            pcb_118    alpha.chlordane                ddt       methoxychlor            pcb_170  
           0.3561             0.1527             0.1126             0.0715             0.0678             0.0657  
          pcb_105           dieldrin            pcb_180            pcb_138                dde            pcb_153  
           0.0478             0.0295             0.0295             0.0247             0.0203             0.0094  
          lindane    gamma.chlordane  
           0.0083             0.0040  
Important chemicals defined as mean weights > 1/14~0.071.

ArianeGui commented 4 years ago

Thank you very much for your detailed reply. It is my turn to apologize for the delayed answer. I took me time to make some complementary analyses. I confirm that my positive effects are small when I set Beta1 to be protective.

I have another (and I hope last :-s ) question for you. I have some missing data in my exposure variables (the real values could be below or above the LOD). If I understand well the estimate.wqs function and the default parameter "place.bdls.in.Q1 = if (anyNA(X)) {TRUE } else {FALSE }", these missing values are automatically placed in the first quantile of the weighted sum. Instead, how can I exclude these missing values from the calculations (without excluding the whole rows that may contain data for other pollutants). I tried : "place.bdls.in.Q1 = if (anyNA(X)) {FALSE } else {FALSE }" but I get exactly the same result as with the default statement. Thank you very much in advance for your reply. Best regards.

phargarten2 commented 4 years ago

@ArianeGui
The WQS procedure does not handle missing data. If you set place.bdls.in.Q1 = TRUE, the missing values are automatically placed in the first quantile of the weighted sum, as you have stated. Setting it to FALSE implies that the data is complete; you need to remove all individuals with missing pollutants before running estimate.wqs().

Additionally, I am submitting a new version to CRAN that has included new functions and fixed some bugs found in the package since its debut late last year.

ArianeGui commented 4 years ago

Thank you for your answer regarding the missing data. I will look at the new version of the package very soon.

Have a nice summer.

phargarten2 / miWQS

Poisson WQS Regression Unstable Indices and Rsolnp issues #2