Referee report - Githubissues

This is the referee report we got from a submission. Will open a checklist in this thread.

The authors propose revitalizing classic approaches for interpreting compact binary observations by 'deprojecting' or 'backpropagating' them in time based on an ensemble of forward models which propagate the phase space of (ZAMS binary properties and binary evolution modeling assumptions) into present-day compact binary observations. The authors demonstrate their technique using a 4-dimensional subspace of the COSMIC binary evolution code's model parameters. Their concrete empirical localized Green's function algorithm should be very useful to quickly assess which events can be characteriezd by binary evolution and, if so, with what common assumptions. The authors defer discussion of "the physical implications of the results presented in this work" to a "future companion paper".

While I applaud the authors for using contemporary tools to help insure their work is presently reproducible, I'll limit my review to only what is provided and disseminated by the archival journal (and warn the authors about inevitable linkrot).

I like seeing this idea being realized, and recommend this paper for publication once the authors address the comments comments below. I regret their length, and emphasize that I anticipate only some edits to the draft are needed.

Primary

(A) Introduction and context: The authors give short shrift to other approaches to constrain binary evolution hyperparameters, highlighting only the single most technically similar work. Brief references to other efforts to directly compare binary evolution to model predictions would be helpful (e.g., a highly nonexhaustive list: Zevin et al 2017 ApJ; Wysocki et al 2018 PRD; Barrett et al 2018 MNRAS; Taylor and Gerosa 1806.08365; Spera et al 2019; ... and, for two brand new studies, Mastrogiovanni 2207.00374; Delfavero et al 2107.13082 )

The authors also don't point out the similarity of their approach to the the long history of using forward/inverse distribution propagation (e.g., Kalogera et al 1996 astro-ph/9605186), let alone the history of reconstructing progenitor populations via comparison to (discrete) grids of simulations, used for example in Stevenson et al 2017, Belczynski et al 2016, ...to understand the origins of their sources (i.e., essentially all groups, over many decades, including origins of binary pulsars and WD, X-ray binaries, etc; see, e.g., Clausen et al 1201.0012; van den Heuvel lectures; etc).

I ask the authors more carefully differentiate their novel continuous contribution from these and other prior works. I ask because I'm sure the authors can use this opportunity to show how easily they reproduce saved-trajectory/run-based approaches and how much more flexible their technique will be.

(B) Methods paper details: technique, validation, ... ? The single page this methods paper allocates to its details appear insufficient to describe their approach and its validation.

Just to pick the first technical example: what is the dimension of $\theta exactly? Mass-only + log mass; solar mass; ...? Please explicitly use the symbols \theta',\theta,\lambda and provide symbols in the discussion at the start of section 3, when you define the list? Similarly when discussing Eq. (5): the coordinates and scales for your performance metric are very important.
How is the KDE implemented (e.g., coordinate scales, choice of smoothing)? Is the likelihood estimated by prior-reweighting + KDE sufficiently accurate for all events considered (e.g., finite training data; smoothing scales vs features)? [Presumably so, since this seems to be mass-only.]
How robust is the root-finding method at finding representative examples of all pertinent solutions to the many channels that can contribute to different events (as highlighted in the introduction)? Why is 1000 attempts enough?

The initial guess procedure seems to rely on a test suite of uniformly sampled points over the binary parameter (and hyperparameter?) space. Keeping in mind some configurations (e.g., kicks which don't disrupt binaries but fine-tune merger times for BBHs) can be rare but overall important to the observed BBH rates (when detection-weighted), can you outline why your uniformly selected set of sample points is likely to be sufficient -- for example, how many points were chosen? The discussion on L638-658 belongs earlier in the text (e.g., a 'validation' section or the methods section)

Section 3 first paragraph should be in the methods section, as noted above using previously-defined symbols \theta',\theta, \lambda, X, and refer to Table 1 ln L36-370 when the parameters are actually introduced, not L329. Describe any postprocessing of the source redshift outlined in L341-L361. What if any SFR model was used?
COSMIC's modeling assumptions should be reviewed in section 2. Notably, I shouldn't have to find out what SN engine is being used in L383!
What if any prior is applied the ZAMS mass, eccentricity, and orbital period? These priors (particularly mass) have a very strong effect on observables.
For 150914 (and for many events), you don't invoke SN kicks at all, assuming complete fallback. In this case all evolution is deterministic. You should highlight this fact in the methods, as it considerably strengthens this reader's confidence in your conclusions about massive events.
The authors provide a loose single-event validation demonstration (Fig 3). The authors don't indicate if the method has any systematic challenges over the model space (e.g., extending the discussion from L539-550, does it perform poorly near regions where binary evolution has difficulty generating solutions?) The authors also only perform a validation study in the deterministic regime, by far the simplest, but their most tantalizing result in Fig 4 involves lower masses where SN kicks contribute.

(C) Discussion of event rates/counts/rates? Other approaches to constrain astrophysical formation model hyperparameters can rely heavily on the event rate or observed counts: the Poisson term from observed events can be the dominant factor of the overall inhomogeneous poisson likelihood. Please discuss the extent to which this approach accounts (or not) for the different BBH formation rate for different models.

Alternatively, in Eq. (4), I think I expect an additional term $Z(\lambda) associated with the rate normalization as a function of model hyperparameters.

When applied to GW150914, are all the models in the posterior producing event rates that are consistent with the observed (single) count from O1?

[If you don't want to do this check, that's fine, just say very clearly how you are eliding including rate in the calculation.]

(D) Too long/non-cohesive section 3: Break results at L511 ('potential benefit') as 'discussion', change 'discussion' to 'conclusions'. Focus on Fig 4 there. Also consider moving more material from that region to the conclusions (e.g., L511-555)

(E) Flexible hyperparameters versus modeling systematics I like the idea behind treating binary evolution parameters as locally-constant features of the evolution trajectory: given how poor our phenomenology has been, it's been important to allow that flexibility to identify systematic modeling limitations (e.g., q_{crit, CE}). However, I'm uncomfortable with L87: I would expect the same physics to apply, if that physics was accurate. I suggest the authors massage this sentence to highlight their ability to identify modeling systematics/incompleteness in approaches to binary evolution, which I believe is the authors' intent.

(F) L523/Fig 4 : Real or mirage? Can the authors give some number for how likely such a trend is to occur by chance? The lowest 3 events having outliers below? [I call this out because the authors highlight it in the abstract but don't support it with a calculation.]

I'm also uncomfortable with highlighting this feature in the abstract and conclusions, even qualified as 'may vary with', as I'd expect some discussion of significance. An alternative ('Are consistent with') may also be too strong without some discussion of what constitutes consistency.

Minor (a) Abstract: Must indicate the dimensionality and degrees of freedom explored here (ie, summarize bottom rows of Table 1).

(b) Abstract: last sentence: unsupported, remove or change to 'could be consistent with' instead of 'may be'. (See comments on Fig 4 elsewhere).

(c) 'inference of hyperparameter settings must proceed dynamically' : Change to 'we proceed dynamically', 'must' is not warranted; other techniques exist to infer model hyperparameters, e.g. Delfavero et al; Gallegos-Garcia; Frago's POSYDON; Andrew's dart_board)

(d) L165: Origin strongly correlated with hyperparameters, many channels: Presumably demonstrated before in some smaller extent; see citation list above for possible references.

(e) L260: 'regardless to its initial guess' : Unclear what these sentences are saying.

(f) L284-287: Symbols help: state that full knowledge of $\Theta$ is inaccessible, you only have access to $\theta',\lambda$ but not X.

(g) L 294-297: A symbolic expression would help here: you generate new random X' for every pair (\theta',\lambda) and, via the forward map, find a new posterior based on samples $\theta_k = (\theta',X'\lambda)_k$?

(h) L381: \sigma does not impact formation: in contrast to other work, but that work is based on event rates in addition, which are not included here; also, they explain zero natal kick for this event.

(i) L388: This should be stated clearly at the start of section 3!

(j) L480-L496: These statements ('never') are very strong. Since in this mass range the authors invoke fallback kicks, it's believable. I would be very skeptical however for any conclusions made about low-mass events where SN kicks could play a role.

(k) L497-L510: Recommend presenting this validation test early in section 3, before detailed discussion of hyperparameter posteriors.

(l) L 559-561: should discuss the cutoff applied before L519 where a claim is made about what the sample means.

(m) L576: More efficient how/why? Compared to what?

(n) L581: Persistent comment about Fig 4, particularly given lack of discussion of significance (and not including rate constraints overall between all events -- no guarantee the models in these posteriors have consistent counts).

(o) L592: See (E) above. Can be misread as "the authors allow for each merger to be drawn from its own realization of the physical universe".

(p) L585-591: Sharper contextual framing will help. For exmaple, 'study the distribution of hyperparameters, a thoroughly explored approach for phenomeonogical models but rarely employed for detailed binary evolution models'. The authors' contributions are more unique in context than overall (i.e., many previous/other works have done this with phenomenology).

(q) L594-L612: See (C) above - the authors' pullback doesn't involve selection at all right now. Add phrase, 'and added the appropriate Poisson factor to account for selection effects' after 'once we have pulle back ...space'.

(r) L614 : 'for the first time' : not true, see above eg (A); remove.

(s) L682: should mention previous grid-based works here, see (A) and mention Andrews 2021, and highlight why/how you've improved on them.

Data Editor's review: One of our data editors has reviewed your initial manuscript submission and has the following suggestion(s) to help improve the data, software citation and/or overall content. Please treat this as you would a reviewer's comments and respond accordingly in your report to the science editor. Questions can be sent directly to the data editors at data-editors@aas.org.

The authors should use the Zenodo DOI instead of its url. This will make the linking more robust in the future and discoverable.

Primary
- Introduction and context
- [x] The authors give short shrift to other approaches to constrain binary evolution hyperparameters, highlighting only the single most technically similar work. Brief references to other efforts to directly compare binary evolution to model predictions would be helpful (e.g., a highly nonexhaustive list: Zevin et al 2017 ApJ; Wysocki et al 2018 PRD; Barrett et al 2018 MNRAS; Taylor and Gerosa 1806.08365; Spera et al 2019; ... and, for two brand new studies, Mastrogiovanni 2207.00374; Delfavero et al 2107.13082
- [x] The authors also don't point out the similarity of their approach to the long history of using forward/inverse distribution propagation (e.g., Kalogera et al 1996 astro-ph/9605186), let alone the history of reconstructing progenitor populations via comparison to (discrete) grids of simulations, used for example in Stevenson et al 2017, Belczynski et al 2016, ...to understand the origins of their sources (i.e., essentially all groups, over many decades, including origins of binary pulsars and WD, X-ray binaries, etc; see, e.g., Clausen et al 1201.0012; van den Heuvel lectures; etc).
- [x] I ask the authors more carefully differentiate their novel continuous contribution from these and other prior works. I ask because I'm sure the authors can use this opportunity to show how easily they reproduce saved-trajectory/run-based approaches and how much more flexible their technique will be.
- Methods paper details: technique, validation, ... ?
- [x] Just to pick the first technical example: what is the dimension of $\theta exactly? Mass-only + log mass; solar mass; ...? Please explicitly use the symbols \theta',\theta,\lambda and provide symbols in the discussion at the start of section 3, when you define the list? Similarly when discussing Eq. (5): the coordinates and scales for your performance metric are very important.
- [x] How is the KDE implemented (e.g., coordinate scales, choice of smoothing)? Is the likelihood estimated by prior-reweighting + KDE sufficiently accurate for all events considered (e.g., finite training data; smoothing scales vs features)? [Presumably so, since this seems to be mass-only.]
- [x] How robust is the root-finding method at finding representative examples of all pertinent solutions to the many channels that can contribute to different events (as highlighted in the introduction)? Why is 1000 attempts enough?
- [x] The initial guess procedure seems to rely on a test suite of uniformly sampled points over the binary parameter (and hyperparameter?) space. Keeping in mind some configurations (e.g., kicks which don't disrupt binaries but fine-tune merger times for BBHs) can be rare but overall important to the observed BBH rates (when detection-weighted), can you outline why your uniformly selected set of sample points is likely to be sufficient -- for example, how many points were chosen? The discussion on L638-658 belongs earlier in the text (e.g., a 'validation' section or the methods section)
  - [x] Section 3 first paragraph should be in the methods section, as noted above using previously-defined symbols \theta',\theta, \lambda, X, and refer to Table 1 ln L36-370 when the parameters are actually introduced, not L329. Describe any postprocessing of the source redshift outlined in L341-L361. What if any SFR model was used?
  - [x] COSMIC's modeling assumptions should be reviewed in section 2. Notably, I shouldn't have to find out what SN engine is being used in L383!
  - [x] What if any prior is applied the ZAMS mass, eccentricity, and orbital period? These priors (particularly mass) have a very strong effect on observables.
  - [x] For 150914 (and for many events), you don't invoke SN kicks at all, assuming complete fallback. In this case all evolution is deterministic. You should highlight this fact in the methods, as it considerably strengthens this reader's confidence in your conclusions about massive events.
  - [x] The authors provide a loose single-event validation demonstration (Fig 3). The authors don't indicate if the method has any systematic challenges over the model space (e.g., extending the discussion from L539-550, does it perform poorly near regions where binary evolution has difficulty generating solutions?) The authors also only perform a validation study in the deterministic regime, by far the simplest, but their most tantalizing result in Fig 4 involves lower masses where SN kicks contribute.
- Discussion of event rates/counts/rates?
  - [x] Other approaches to constrain astrophysical formation model hyperparameters can rely heavily on the event rate or observed counts: the Poisson term from observed events can be the dominant factor of the overall inhomogeneous poisson likelihood. Please discuss the extent to which this approach accounts (or not) for the different BBH formation rate for different models.
- [x] Alternatively, in Eq. (4), I think I expect an additional term $Z(\lambda) associated with the rate normalization as a function of model hyperparameters.
- [x] When applied to GW150914, are all the models in the posterior producing event rates that are consistent with the observed (single) count from O1?
- Too long/non-cohesive section 3:
- [x] Break results at L511 ('potential benefit') as 'discussion', change 'discussion' to 'conclusions'. Focus on Fig 4 there. Also consider moving more material from that region to the conclusions (e.g., L511-555)
- Flexible hyperparameters versus modeling systematics
  - [x] I like the idea behind treating binary evolution parameters as locally-constant features of the evolution trajectory: given how poor our phenomenology has been, it's been important to allow that flexibility to identify systematic modeling limitations (e.g., q_{crit, CE}). However, I'm uncomfortable with L87: I would expect the same physics to apply, if that physics was accurate. I suggest the authors massage this sentence to highlight their ability to identify modeling systematics/incompleteness in approaches to binary evolution, which I believe is the authors' intent.
- L523/Fig 4 : Real or mirage?
- [x] Can the authors give some number for how likely such a trend is to occur by chance? The lowest 3 events having outliers below? [I call this out because the authors highlight it in the abstract but don't support it with a calculation.]
- [x] I'm also uncomfortable with highlighting this feature in the abstract and conclusions, even qualified as 'may vary with', as I'd expect some discussion of significance. An alternative ('Are consistent with') may also be too strong without some discussion of what constitutes consistency.
Minor
- [x] Abstract: Must indicate the dimensionality and degrees of freedom explored here (ie, summarize bottom rows of Table 1).
- [x] Abstract: last sentence: unsupported, remove or change to 'could be consistent with' instead of 'may be'. (See comments on Fig 4 elsewhere).
- [x] inference of hyperparameter settings must proceed dynamically' : Change to 'we proceed dynamically', 'must' is not warranted; other techniques exist to infer model hyperparameters, e.g. Delfavero et al; Gallegos-Garcia; Frago's POSYDON; Andrew's dart_board)
  - [x] L165: Origin strongly correlated with hyperparameters, many channels: Presumably demonstrated before in some smaller extent; see citation list above for possible references.
  - [x] L260: 'regardless to its initial guess' : Unclear what these sentences are saying.
  - [x] L284-287: Symbols help: state that full knowledge of is inaccessible, you only have access to but not X.
  - [x] L 294-297: A symbolic expression would help here: you generate new random X' for every pair (\theta',\lambda) and, via the forward map, find a new posterior based on samples ?
  - [x] L381: \sigma does not impact formation: in contrast to other work, but that work is based on event rates in addition, which are not included here; also, they explain zero natal kick for this event.
  - [x] L388: This should be stated clearly at the start of section 3!
  - [x] L480-L496: These statements ('never') are very strong. Since in this mass range the authors invoke fallback kicks, it's believable. I would be very skeptical however for any conclusions made about low-mass events where SN kicks could play a role.
  - [x] L497-L510: Recommend presenting this validation test early in section 3, before detailed discussion of hyperparameter posteriors.
  - [x] L 559-561: should discuss the cutoff applied before L519 where a claim is made about what the sample means.
  - [x] L576: More efficient how/why? Compared to what?
  - [x] L581: Persistent comment about Fig 4, particularly given lack of discussion of significance (and not including rate constraints overall between all events -- no guarantee the models in these posteriors have consistent counts).
  - [x] L592: See (E) above. Can be misread as "the authors allow for each merger to be drawn from its own realization of the physical universe".
  - [x] L585-591: Sharper contextual framing will help. For exmaple, 'study the distribution of hyperparameters, a thoroughly explored approach for phenomeonogical models but rarely employed for detailed binary evolution models'. The authors' contributions are more unique in context than overall (i.e., many previous/other works have done this with phenomenology).
  - [x] L594-L612: See (C) above - the authors' pullback doesn't involve selection at all right now. Add phrase, 'and added the appropriate Poisson factor to account for selection effects' after 'once we have pulle back ...space'.
  - [x] L614 : 'for the first time' : not true, see above eg (A); remove.
  - [x] L682: should mention previous grid-based works here, see (A) and mention Andrews 2021, and highlight why/how you've improved on them.
Data Editor's review:
[x] One of our data editors has reviewed your initial manuscript submission and has the following suggestion(s) to help improve the data, software citation and/or overall content. Please treat this as you would a reviewer's comments and respond accordingly in your report to the science editor. Questions can be sent directly to the data editors at data-editors@aas.org.

The authors should use the Zenodo DOI instead of its url. This will make the linking more robust in the future and discoverable.

kazewong / BackPop

Referee report #19