Our procedure are not an apple-to-apple comparison to that used in PhenomD. Referees suggest us to rerun the code with matching training waveforms (include BAM waveforms) and possibly use hybrid waveforms for a fair comparison.
[x] Fig. 1: Please clarify in the figure or the caption which averaged mismatch is
used to compute the loss. Moreover, please explain why the validation loss is
lower than the training loss during most of the optimization procedure and
certainly at the point where the training loss has its global minimum. Usually,
when using randomized splits of data into training and validation sets this
should not be the case, and the validation loss should be higher than the
training loss.
[x] Sec 2.2: The mismatch can be interpreted as a mean squared error, but it is certainly not unique: change "the MSE" to "a MSE". In the text L_mean seems to refer to L_ave. Please pick one and use it consistently. Please define what you mean by "initial mismatch".
[ ] As it stands I would suggest that the authors tone down the result as stated in
the abstract and in Sec. 3, by adding that the tails of the mismatch
distributions (as shown in Fig. 4) hardly change, while the mode and median
change as described.
[ ] Sec 4: Indeed there is a dependency between the two aligned spin components
which is not captured by the current ansatz. This is hardly surprising and well
understood: using an effective spin instead of the two component spins in the
model parametrization is obviously an approximation. For instance, there are
higher order terms in the PN expansion beyond the leading spin-orbit term that
make this clear. It is therefore simple to address this by instead using (chi_1,
chi_2) as spin parameters in the model ansatz which is done for more recent
waveform models.
[x] Eq 2: what are the "i" superscripts denoting? Please add this to the paper.
[x] Eq 7: I didn't quite get what {\cal M}_i is denoting. In particular, what
does "initial mismatch" refer too? My guess is perhaps the waveform before
calibration (ie the original PhenomD model), is that correct? Perhaps further
clarification can be added to the paper.
[x] page 4: is tapering applied to the beginning and end of the NR waveforms?
While perhaps a small effect, the NR waveform does not go to zero at late times.
Not tapering is known to have a small impact on the FFT'ed NR signal. The paper
should state if tapering was or was not done on the waveforms end portion.
[x] Algorithm 1: N is the number of iterations. But N is already being used to
denote the number of training waveforms. So does "number of training waveforms"
= "number of iterations"? If not, perhaps another variable name could be used.
[ ] page 4: The authors reasonably choose to set the initial value of \lambda to
the PhenomD one. But given the high dimensionality of the calibration parameter
space, there could be many local minima. It might be worthwhile to set the
initialization value to be \lambda = \lambda_0 + \eps where \eps is drawn from a
Gaussian of mean 0 and some small variance. Re-solving the optimization for many
different initial values could provide for better identification of a global
minimum.
[ ] SXS waveform data: are the center of mass (CoM) corrected waveforms used?
What extrapolation order is used? It would be good to provide those details
somewhere. In particular, CoM-corrected data should be of higher quality. Also,
the SXS:BBH:0001 waveform is sufficiently old that it probably cannot be trusted
(numerous SpEC bug fixes since then).
Major Comments
Our procedure are not an apple-to-apple comparison to that used in PhenomD. Referees suggest us to rerun the code with matching training waveforms (include BAM waveforms) and possibly use hybrid waveforms for a fair comparison.
Minor Comments
[x] Introduction: "match filtering" -> "matched filtering"
[x] Fig. 1: Please clarify in the figure or the caption which averaged mismatch is used to compute the loss. Moreover, please explain why the validation loss is lower than the training loss during most of the optimization procedure and certainly at the point where the training loss has its global minimum. Usually, when using randomized splits of data into training and validation sets this should not be the case, and the validation loss should be higher than the training loss.
[x] Sec 2.2: The mismatch can be interpreted as a mean squared error, but it is certainly not unique: change "the MSE" to "a MSE". In the text L_mean seems to refer to L_ave. Please pick one and use it consistently. Please define what you mean by "initial mismatch".
[ ] As it stands I would suggest that the authors tone down the result as stated in the abstract and in Sec. 3, by adding that the tails of the mismatch distributions (as shown in Fig. 4) hardly change, while the mode and median change as described.
[ ] Sec 4: Indeed there is a dependency between the two aligned spin components which is not captured by the current ansatz. This is hardly surprising and well understood: using an effective spin instead of the two component spins in the model parametrization is obviously an approximation. For instance, there are higher order terms in the PN expansion beyond the leading spin-orbit term that make this clear. It is therefore simple to address this by instead using (chi_1, chi_2) as spin parameters in the model ansatz which is done for more recent waveform models.
[x] Eq 2: what are the "i" superscripts denoting? Please add this to the paper.
[x] Eq 7: I didn't quite get what {\cal M}_i is denoting. In particular, what does "initial mismatch" refer too? My guess is perhaps the waveform before calibration (ie the original PhenomD model), is that correct? Perhaps further clarification can be added to the paper.
[x] page 4: is tapering applied to the beginning and end of the NR waveforms? While perhaps a small effect, the NR waveform does not go to zero at late times. Not tapering is known to have a small impact on the FFT'ed NR signal. The paper should state if tapering was or was not done on the waveforms end portion.
[x] Algorithm 1: N is the number of iterations. But N is already being used to denote the number of training waveforms. So does "number of training waveforms" = "number of iterations"? If not, perhaps another variable name could be used.
[ ] page 4: The authors reasonably choose to set the initial value of \lambda to the PhenomD one. But given the high dimensionality of the calibration parameter space, there could be many local minima. It might be worthwhile to set the initialization value to be \lambda = \lambda_0 + \eps where \eps is drawn from a Gaussian of mean 0 and some small variance. Re-solving the optimization for many different initial values could provide for better identification of a global minimum.
[ ] SXS waveform data: are the center of mass (CoM) corrected waveforms used? What extrapolation order is used? It would be good to provide those details somewhere. In particular, CoM-corrected data should be of higher quality. Also, the SXS:BBH:0001 waveform is sufficiently old that it probably cannot be trusted (numerous SpEC bug fixes since then).