Is it possible to get some kind of "local significance level" accounting only for the sequentiality of the analysis?

Dear professor Grayling,

As I anticipated in the previous post, I am a little embarrassed in interpreting the tables printed. I know it is surely my fault, not having studied in depth your project, that is complex indeed.

Until I will be more aware of its thorough working, to provisionally simplify (I hope not unduly!), my aim is to get a kind of local alpha taking account for sequentiality to guarantee "the overall test" to stay below the, say, alpha=0.05. I would enter it as a sequentially-adjusted alpha level into any kind of multiple comparison I will do next.

As an example I report the output of the rpact application, which though does not allow for multi-arm design.

Power calculation for a binary endpoint

Sequential analysis with a maximum of 2 looks (group sequential design), overall 
significance level 5% (two-sided).
The results were calculated for a two-sample test for rates (normal approximation),
H0: pi(1) - pi(2) = 0, H1; treatment rate pi(1) = 0.75, control rate pi(2) = 0.4, 
maximum number of subjects = 80.

Stage                                         1      2 
Information rate                            50%   100% 
Efficacy boundary (z-value scale)         2.178  2.178 
Overall power                            0.5477 0.8716 
Expected number of subjects                58.1 
Number of subjects                         40.0   80.0 
Cumulative alpha spent                   0.0294 0.0500 
Two-sided local significance level       0.0294 0.0294 
Lower efficacy boundary (t)               0.341  0.243 
Upper efficacy boundary (t)              -0.299 -0.221 
Exit probability for efficacy (under H0) 0.0294 
Exit probability for efficacy (under H1) 0.5477 

Legend:
  (t): treatment effect scale

Here, in my understanding, the wanted alpha level for the first interim would be 0.0294, that should be corrected for sequentiality only, being it relative to a two-arm design. It is equal to the final stage, having been chosen the Pocock's design. Setting the O'Brien & Fleming design the two numbers become:

Two-sided local significance level 0.0052 0.0480

in agreement with my expectations.

In the "Operating characteristics summary: "(key, error rates and others) I am unsure to correctly interpret the big deal of data so, to be safe, I would prefer to use a more simple and well-identified datum.

Please, may you tell me what is the number I am looking for? In alternative, may you tell me which numbers should I look at to get the alpha level to use with the omnibus test and with the direct comparisons between control and experimental groups?

Thank you in advance! Paolo

PS This information is easily found even in your function in case of one-stage design, in which the multiplicity correction method can be chosen. As an example:

Design summary

Inputs
The following choices were made:

K=3  experimental treatments will be included in the trial.
A significance level of  α=0.05  will be used, in combination with the Holm-Bonferroni correction.
The response rate in the control arm will be assumed to be:  π0=0.3 .
The marginal power for each null hypothesis will be controlled to level  1−β=0.8  under each of their respective least favourable configurations.
The interesting and uninteresting treatment effects will be:  δ1=0.45  and  δ0=0  respectively.
The target allocation to each of the experimental arms will be: the same as the control arm.
The sample size in each arm will be required to be an integer.
Plots will not be produced.
Outputs
The total required sample size is:  N=72 .
The required sample size in each arm is:  (n0,…,nK)=(18,18,18,18) .
Therefore, the realised allocation ratios to the experimental arms are:  (r1,…,rK)=(1,1,1) .
The maximum familywise error-rate is:  0.043 .
The minimum marginal power is:  0.816 .
The following critical thresholds should be used with the chosen multiple comparison correction:  (0.017,0.025,0.05) .

Here the output is clearer: sample size in each arm = 18, critical thresholds for multiple comparisons with the step-down method = (0.017,0.025,0.05) in particular. I am not sure, however, what exactly the "maximum familywise error rate" = 0.043 represents, i.e. why it is not 0.05 (I believe it is something may be overlooked for it is only linked to the three comparisons with the Bonferroni-Holm's method, since it seems independent from specific data).

In the sequential section I would have expected something similar. However, the correction method is no longer selectable. I guess there is a reason for this, but I hope there is a way to get the equivalent information, maybe in the tables ...

As an instance, if the single-stage design above were to become a stage of two (for a total N=72*2=144 samples) having the aim of decreasing the effect to detect but allowing to stop earlier the trial in case a more substantial one was present, by how much the multiple comparisons thresholds of the Bonferroni-Holm's correction would be penalized in the two stages using the Pocock design? Is there a simple way to answer this question with your package?

Maybe I have a reasonable answer.

I tried to design both a single- and a multi-stage design with equal characteristic (except than the multiplicity correction, that is not [yet?] provided for multi-stage design), i.e.: Bernoulli, power 80%, alpha 0.05%, control rate=0.33, d1=0.33, d0=0, type of control=MARGINAL.

In the single-stage, the output is below:

Design summary

Inputs
The following choices were made:

K=3  experimental treatments will be included in the trial.
A significance level of  α=0.05  will be used, in combination with the Holm-Bonferroni correction.
The response rate in the control arm will be assumed to be:  π0=0.33 .
The marginal power for each null hypothesis will be controlled to level  1−β=0.8  under each of their respective least favourable configurations.
The interesting and uninteresting treatment effects will be:  δ1=0.33  and  δ0=0  respectively.
The target allocation to each of the experimental arms will be: the same as the control arm.
The sample size in each arm will be required to be an integer.
Plots will be produced.
Outputs
The total required sample size is:  N=148 .
The required sample size in each arm is:  (n0,…,nK)=(37,37,37,37) .
Therefore, the realised allocation ratios to the experimental arms are:  (r1,…,rK)=(1,1,1) .
The maximum familywise error-rate is:  0.043 .
The minimum marginal power is:  0.81 .
The following critical thresholds should be used with the chosen multiple comparison correction:  (0.017,0.025,0.05) .

In the multi-stage, instead it is:

Design summary

Inputs
The following choices were made:

K=3  experimental treatments will initially be included in the trial.
A maximum of  J=2  stages will be allowed in the trial.
A separate stopping rule will be used.
The stage-wise sample size will be variable.
The lower stopping boundaries will be Pocock.
The upper stopping boundaries will be Pocock.
A family-wise error-rate of  α=0.05  will be used.
The minimum marginal power will be controlled to level  1−β=0.8  under each of their respective least favourable configurations.
The response rate in the control arm will be assumed to be:  π0= .
The interesting and uninteresting treatment effects will be:  δ1=0.33  and  δ0=0  respectively.
The target allocation to each of the experimental arms will be: the same as the control arm.
The sample size in each arm will be required to be an integer.
Plots will be produced.
Outputs
The maximum required sample size is:  160 .
The stage-wise sample size to the control arm is: 20.
The realised allocation ratios to the experimental arms are:  (r1,…,rK)=(1,1,1) .
The maximum familywise error-rate is:  0.05 .
The minimum marginal power is:  0.818 .
The lower stopping boundaries are:  (−2.28,2.28) .
The upper stopping boundaries are:  (2.28,2.28) .

My conclusion is that the sequential familywise error rate for the multi-stage design should be exactly 0.05, just as the Output says.

I was in doubt since I was not being able to individuate the penalization for the sequential design. By comparing the two, though, I noted that the required sample-size for single-stage was 37 per group, total=148, while for multi-stage it is 20+20=40, total = 160. Maybe the 3 units per group are the cost of two-stage design? If it were so, it would be a very interesting price!

The output seems now reasonably clear: as I expected, in the multi-stage case the app provides only the calculation for the sequentiality correction.

Am I wrong?

PS As you can see, the row:

The response rate in the control arm will be assumed to be: π0= .

does not show the entered value (0.33).

Moreover, in downloading the pdf and docx reports, the tables are not those reported by the web app:

    π0  π1  π2  π3  Pdis    Pcon    P1  P2  P3  FWER    PHER    ESS

HG  0.33    0.33    0.33    0.33    0.05    0.001   0.019   0.019   0.019   0.05    0.019   158.626
HA  0.33    0.66    0.66    0.66    0.959   0.638   0.818   0.818   0.818   0   0   127.156
LFC1    0.33    0.66    0.33    0.33    0.819   0.003   0.818   0.019   0.019   0.036   0.013   149.616
LFC2    0.33    0.33    0.66    0.33    0.819   0.003   0.019   0.818   0.019   0.036   0.013   149.617
LFC3    0.33    0.33    0.33    0.66    0.819   0.003   0.019   0.019   0.818   0.036   0.013   149.617
Equal #1    0.33    0   0   0   0   0   0   0   0   0   95.633
Equal #2    0.33    0.007   0.007   0.007   0   0   0   0   0   0   99.052
Equal #3    0.33    0.013   0.013   0.013   0   0   0   0   0   0   102.136
Equal #4    0.33    0.02    0.02    0.02    0   0   0   0   0   0   105.22
Equal #5    0.33    0.027   0.027   0.027   0   0   0   0   0   0   108.305

but look like these:

pi0 pi1 pi2 pi3 FWERI1  FWERI2  FWERI3  FWERII1 FWERII2 FWERII3
0   0   0   0   0   0   0   0   0   0
0   1   1   1   0   0   0   0   0   0
0   1   0   0   0   0   0   0   0   0
0   0   1   0   0   0   0   0   0   0
0   0   0   1   0   0   0   0   0   0

pi0 pi1 pi2 pi3 Pdis    Pcon    P1  P2  P3
0   0   0   0   0   0   0   0   0
0   1   1   1   1   1   1   1   1
0   1   0   0   1   0   1   0   0
0   0   1   0   1   0   0   1   0
0   0   0   1   1   0   0   0   1

pi0 pi1 pi2 pi3 PHER    FDR pFDR    FNDR    Sens    Spec    ESS SDSS    MeSS    MoSS    maxN
0   0   0   0   0   0   1   0   0   1   159 6   160 160 160
0   1   1   1   0   0   0   0   1   0   127 29  140 160 160
0   1   0   0   0   0   0   0   1   1   150 11  160 160 160
0   0   1   0   0   0   0   0   1   1   150 11  160 160 160
0   0   0   1   0   0   0   0   1   1   150 11  160 160 160

Hi Paolo,

The output from the multi-stage commands is to control the FWER to specified level alpha (i.e., it adjusts for both having multiple arms and multiple stages). What is returned are upper (efficacy) and lower (futility) stopping boundaries on the Z-scale. You can find the efficacy boundaries for example with

des <- des_gs_norm() des$e

To convert to the p-value scale all you need to run is

1 - pnorm(des$e)

In the single-stage case, there are a variety of multiple comparison procedures available (to adjust for having multiple arms) - the selected one in the code above is Holm-Bonferroni. This doesn't leverage the (approximately known) correlation between the test statistics, which is what gives rise to the maximal FWER (0.043) being below the allowed level (0.05). If you were to change the single-stage implementation to use Dunnett's correction, this would arguably be a fairer comparison of the cost of incorporating the interim analysis to the maximal sample size, as the multi-stage commands only implement a generalised Dunnett test (though in the grand scheme of things it's not going to change much compared to using Holm-Bonferroni).

Hope that makes sense,

Michael

It makes a lot of sense.

As for the 0.043, I have to apologize: had I tried to select the Bonferroni's correction I would have realized by myself that the deviation from 0.05 was not due to Holm! Notwithstanding I had sometimes used the R software provided by Derringer (2018) A simple correction for non-independent tests, related to Nyholt (2004) A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other, I did not believe that the... cost of the correlations between tests could be shown so convincingly. Your app is a mine of useful suggestions!

I had preferred the Bonferroni-Holm correction since I expected that two of the comparisons would have had very small p-values while the third would have been on the boundary. Using the step-down method, I thought I would provide the maximum power to this last sacrificing others little or nothing.

Surely being conditioned from the rpact application, which uses for multi-stage design the same regimen that multiarm for single-stage - i.e. it reports the new threshold for p-values, I did not well realize that for multi-stage designs you had decided to use only the stopping boundaries on the Z scale, but including in them all the adjustments.

It would be interesting to know the why of this choice (they would seem equivalent ways at all...). Surely, to use the quantiles of the normal distribution is just as simple and it is even clearer if you know they are already adjusted for multiplicity too!

Please, let me know if it is possible to download tables equal to those printed by the shine app (I have copied and pasted them but I fear they are not complete), or if I've done something wrong to get only a lot of ones and zeros (they seem model matrices...!). Or, in alternative, if they have some sense the way they are now.

Thank you again, Michael.

It's really just personal preference around how to specify the boundaries - I've worked mostly in the past defining them in terms of Z, but it is as you say completely equivalent. I can add returning both in a future version easily enough.

On the tables, it may be that I've updated the shiny app but not the package as they run off different repositories. Can I check what version of multiarm you're using (you can find this with packageVersion("multiarm"))?

It's really just personal preference around how to specify the boundaries - I've worked mostly in the past defining them in terms of Z, but it is as you say completely equivalent. I can add returning both in a future version easily enough.

Yes, I understand. But keep in mind that many R functions [i.e. multcomp::glht()] apply the multiplicity adjustment by default. Having to manage overlapping adjustments may become quite uncomfortable, as I will try to show.

If the multi-stage output reports the fully adjusted Z boundary, the risk arises to print plots showing confidence intervals having a twofold multiplicity correction, one applied in shine, which is conveyed in the level parameter to be passed to glht, and the other applied by default by the function.

To prevent the issue, I have to get first the p-value fully adjusted as shown by you, then pass its ones' complement to the level parameter and finally disable the default multiplicity adjustment, this last, according to the help file, by means of the parameter calpha=univariate_calpha():

model <- coxph(Surv(time, status) ~ FACTOR, data = df, robust=TRUE)
ph <- glht(model, linfct=mcp(FACTOR="Dunnett"))
lev<-pnorm(design$e)[1]
dev.new(height=4, width=7); par(oma=c(0,8,0,0))
plot(confint(ph, calpha=univariate_calpha(), level=lev), main="95% family-wise CI \nadjusted for sequentiality and multiplicity")

Note that the output of confint(ph, calpha=univariate_calpha()) always shows, somehow confusingly "Simultaneous Confidence Intervals":

> confint(ph, calpha=univariate_calpha())

                 Simultaneous Confidence Intervals

Multiple Comparisons of Means: Dunnett Contrasts

Fit: coxph(formula = Surv(time, status) ~ FACTOR, data = df, 
    robust = TRUE)

Quantile = 1.96
95% confidence level

even when the multiplicity correction is turned off by means of calpha=univariate_calpha().

In fact, glht interprets the "Dunnett Contrasts" statement as the one-to-all contrasts matrix, without any reference to the Dunnett's multiplicity correction that is always additionally applied (it does so even for the Tukey's Contrasts (pairwise), to which it applies the so-called "single-step correction". As an aside, both the Dunnett and Tukey's single-step corrections take into account the correlations between the outcomes, for as I know). However, the confidence level is reported at the bottom along with the Quantile.

When the level=1-pnorm(design$e[1]) relative to to the Z boundary design$e[1] is used, the Quantile printed by glht is no way equal to:

> design$e[1]
[1] 2.280253

being instead:

> lev
[1] 0.9887037

> confint(ph, calpha=univariate_calpha(), level=lev)

         Simultaneous Confidence Intervals

Multiple Comparisons of Means: Dunnett Contrasts

Fit: coxph(formula = Surv(time, status) ~ FACTOR, data = df, 
    robust = TRUE)

Quantile = 2.5334
98.8703662341852% confidence level

Linear Hypotheses:
             Estimate lwr     upr    
B - A == 0  -0.8685  -1.5299 -0.2071
C - A == 0   1.0600   0.1731  1.9469
D - A == 0   2.6778   2.0307  3.3249

but I still have to understand where the quantile comes from and why it is not equal to:

> qnorm(pnorm(design$e[1])) 
[1] 2.280253

Here it seems that glht() makes some correction, since the p-values entered do not match the quantiles. Please, if you have any idea, let me know!

Fortunately, it is possible to input directly the quantile, or both the quantile and the level. In this last case the quantile prevails while the level is only printed on the screen, without influencing the computations (but only if it is entered as a number! If the statement calpha=univariate_calpha() is used, it is the level argument that prevails!).

> confint(ph, calpha=2.28, level=0.95)

         Simultaneous Confidence Intervals

Multiple Comparisons of Means: Dunnett Contrasts

Fit: coxph(formula = Surv(time, status) ~ FACTOR, data = df, 
    robust = TRUE)

Quantile = 2.28
95% confidence level

Linear Hypotheses:
             Estimate lwr     upr    
B - A == 0  -0.8685  -1.4637 -0.2732
C - A == 0   1.0600   0.2618  1.8582
D - A == 0   2.6778   2.0954  3.2602

> confint(ph, calpha=2.28, level=0.99)

         Simultaneous Confidence Intervals

Multiple Comparisons of Means: Dunnett Contrasts

Fit: coxph(formula = Surv(time, status) ~ FACTOR, data = df, 
    robust = TRUE)

Quantile = 2.28
99% confidence level

Linear Hypotheses:
             Estimate lwr     upr    
B - A == 0  -0.8685  -1.4637 -0.2732
C - A == 0   1.0600   0.2618  1.8582
D - A == 0   2.6778   2.0954  3.2602

Of course, the result obtained entering calpha=2.28 is different from entering calpha=univariate_calpha().

On the tables, it may be that I've updated the shiny app but not the package as they run off different repositories.

Maybe I was unclear. The tables I showed above had been copied and pasted from a .docx that had been downloaded from the shine app. I still do not know if and how it is possible to get a report file from R (this was one of the reasons why I had asked you a technical file or a videotutorial).

In any case, the version installed on my R instance is:

> packageVersion("multiarm")
[1] ‘0.13.5’

Unfortunately, producing confidence intervals with desired coverage for a multi-arm multi-stage design isn't as simple as using the adjusted significance levels that preserve the familywise error-rate (see, e.g., https://www.cytel.com/hubfs/0-library-0/presentations/MehtaSlides-IISA-Pune-2015.pdf), so it's not currently possible to use multiarm to find appropriate CIs.

On a report, that's not currently a feature you can implement in R unfortunately - you can load the GUI offline through multiarm::gui() to avoid need for the internet though. The tables that the download contains are also all made with the opchar_###_###() commands, so it is possible to generate them in R (if not download them to word).

Unfortunately, producing confidence intervals with desired coverage for a multi-arm multi-stage design isn't as simple as using the adjusted significance levels that preserve the familywise error-rate (see, e.g., https://www.cytel.com/hubfs/0-library-0/presentations/MehtaSlides-IISA-Pune-2015.pdf), so it's not currently possible to use multiarm to find appropriate CIs.

I have to apologize for putting the cart before the horse with the confidence intervals! I lacked this fundamental information when I asked for a local adjustment of the significance for sequentiality. Thank you for opening my eyes and for the slides shared!

Admittedly, I was overlooking the specific sequentiality issue. From my point of view, though, I was just applying a way to calculate a set of conventional Bonferroni-Holm confidence intervals.

While according to some opinions it is altogether impossible to compute them at all, according to at least two authors, which propose how to construct as many versions of them just starting from their adjusted p-values, they are actually feasible. I thought to be correct to do so, in force of two conditions: that I would have used validated p-values; and that I would have used one (of two) validated methods.

In fact, I still find it hard to realize why, while being the p-values correct and the parameter distribution known, the intervals might not guarantee cover. If the Ludbrook and Serlin's procedures are correct, two different kinds of p-values would exist, usable and not-usable, which is bewildering to me. So, now it has become undeferrable for me to face the specific literature on the argument (e.g. Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Chapman and Hall/CRC:London, 2000)!

At present, the confidence intervals are an imprescindible requirement, and so this may be a fatal shortcoming of multiarm-multistage design, at least until it will be overcome. What a pity!

mjg211 / multiarm

Is it possible to get some kind of "local significance level" accounting only for the sequentiality of the analysis? #3