Software Paper: Statement of need

Re. openjournals/joss-reviews/issues/6722#issuecomment-2104775251

[x] The Statement of Need feels like it describes the TREvoSim software as a whole, rather than the Need for features that are new in v3. (This also holds for other aspects of the paper.) Perhaps @fboehm could comment as to which of these end members JOSS anticipates?

phylogenetic simulations are conducted

[x] It's not quite clear what a "phylogenetic simulation" is here; do you have in mind the simulation of character data, a tree (which it itself a form of data), or both?

such as birth-death models or randomly generated data

data generated are different in a number of ways to those created using a stochastic model

[x] TREvoSim uses stochastic / random processes to generate data; its data are not deterministic. These sentences do not do enough to distinguish [data simulated within] Trevosim from other approaches.

a level of model misspecification resembling that expected from empirical datasets

[x] I can see what you are driving at here, but it might be worth spelling out to a reader why this is a desirable feature – if I read this without my brain engaged, I might think that you were advertising that the data are bad because they are a poor match to common models (rather than suggesting that current models are bad because real data don't fit their assumptions).
[x] More substantively, it would be nice to somehow demonstrate that the nature of misspecification is similar to and equivalent to that of empirical datasets – though I'm not sure how one would go about that. Your text implies that TREvoSim data are a good proxy for real data because they are both an equally bad fit to existing models, but if the data are a bad fit to the models for different reasons, I don't think it follows that they are equivalent. And there are other ways to generate misspecified data – e.g. to use the GTR model to simulate data and the Mk to analyse it.
I think what's really at stake here is whether TRevoSim data is more "realistic" than data simulated by other approaches, and the data validation offers a good test of this. (It would be interesting to compare the properties of TRevosim simulations with data simulated by other means).

(true) phylogenetic trees and character data are an emergent property of the simulation

[x] This to me is the unique selling point of TREvoSim, and the SoN could do more to emphasize how this allows a fundamentally different nature of question to be addressed (and perhaps to lead with this).

Thanks for all the comments @ms609 - they are really helpful, and indeed, have been the basis of a lot of thought since you have posted them. I have distilled those thoughts into a rewritten statement of need, which I hope addresses them, and clarifies numerous aspects of where I think (hope) TREvoSim may be useful. Specific points:

-- The Statement of Need feels like it describes the TREvoSim software as a whole

Indeed, this is as intended. Given this is the first paper solely dedictated to TREvoSim, it felt appropriate to take this approach, so hopefully JOSS and @fboehm are OK with this!

-- It's not quite clear what a "phylogenetic simulation" is here; do you have in mind the simulation of character data, a tree (which it itself a form of data), or both?

I think the new statement is clearer on the kinds of data I'm trying to talk about throughout - usually both trees and characters.

-- TREvoSim uses stochastic / random processes to generate data; its data are not deterministic. These sentences do not do enough to distinguish [data simulated within] Trevosim from other approaches.

This was poorly phrased, and my thoughts were not clearly expressed on the page. Apologies. What I was getting at is that an important difference between this and other approaches is that this has stochasticity, but that it is filtered through selection, whereas e.g. random data or a birth deah model don't have this additional filter.

-- More substantively, it would be nice to somehow demonstrate that the nature of misspecification is similar to and equivalent to that of empirical datasets / I think what's really at stake here is whether TRevoSim data is more "realistic" than data simulated by other approaches, and the data validation offers a good test of this. (It would be interesting to compare the properties of TRevosim simulations with data simulated by other means).

This is a really nice point, and I don't think my statement did a good job getting across my take on this. The accuracy/naturalism/realism of TREvoSim data is not a message I would particularly like to push too hard. I hope that by incorporating - to quote a new addition to the paper in light of this series of points, "key elements of biological evolution (reproduction, heritability and mutation)" - its data may be more comparable to empirical data than if it did not. However, I want to be careful about claims of accuracy because there is so much we haven't qunatified yet in empirical data! Rather than implying this is necessarily a good proxy for real data I tend to think of it as equivalent to striving towards a potentially more suitable one hatn alternatives. A parallel (if we knew the true nature of our empriical data) would be aiming for for less inacuraccy as opposed to a high level of accuracy. If we use models that are similar to generate data and then infer trees, we might expect (I think, correctly) those to perform better than they would in a situation where the generating model is significantly different, no matter what those specific differences are. Hence TREvoSim provides one option (alongside others you mention) for doing creating such data that we do expect to be mismatched. That can still be better than using than a model without the mismatch, even if it is not a brilliant proxy for real world data. Thus I don't think the difficulty in validating all aspects of simulated data to real world data invalidates its use in this kind of context.

On the basis of your comment, I don't think I had communicated this particularly well (and indeed, it could be that I had not thought this through in as much depth as I needed to be able to clarify these points until your comment forced me to do so - thanks!). This is possibly because I do think TREvoSim is nice (and I hope I'm not wasting my time on writing it) because incorporates some key biological concepts I mentioned before, and I wasn't clear enough that it is hard to then assess whether this makes it a better proxy for empricial data. I have rewritten all of the appropriate parts in the statement of need to try and better communicate these themes.

I do note that down the line, I would love to do a more thorough investigation of the nature of data simulated under different models, and comparing all of that to empirical data (but I don't think this is the venue for that). If you would be interested in this too, we should talk about it further sometime.

-- This to me is the unique selling point of TREvoSim, and the SoN could do more to emphasize how this allows a fundamentally different nature of question to be addressed (and perhaps to lead with this).

I have restructured to lead with this, thanks.

Hi Russell, glad that the comments have been useful. The revised manuscript is much improved. A few follow-up comments:

with three masks in a given environment, the first bit may 78 be 1,0,0 for masks 1,2 and 3 respectively – when this option is enabled, that may be moved to 79 bit three between the first and second environment, and this is repeated for all sites.

I think follow this, after a few readings – I think this is saying that when option is enabled, <the pattern 1, 0, 0> may be <assigned?> to in the second environment. In general, I often find that the subject of 'this' or 'that' can be unclear, so it may be worth specifying more explicitly in the text.

I initially misread the sense due to the coincidence between "bit three" and the fact there are three masks listed. As the bit chosen in the example is arbitrary, perhaps choose an integer that hasn't been used for something else (bit 7, say). Using the same terminology ("first bit / bit 1" vs "seventh bit / bit 7") to refer to bit position might also avoid confusion.

Perhaps rephrasing to "The first bit in masks 1, 2 and 3 may be 1, 0, and 0" would also be clearer, as I got confused when squaring the singular bit with the three entries 1, 0, 0.

What is not clear to me from this description is that shuffling the site patterns in the environment has any tangible difference from shuffling the site patterns in the observed genomes at the end of the simulation. Functionally, I don't see how the 'match peaks' is intended to differ from simply performing n replicates of an identical starting environment.

[The stochastic layer] removes direct control of fitness from the organism genome

This feels slightly too strong; the organism genome is still the primary control on fitness – i.e. fitness does not change unless the genome does – but the strength of the relationship is less prominent. I feel like another half-sentence spelling out many-to-one mapping would help the reader here; what I understand from this is that the stochastic layer is something akin to a codon table where several distinct genome entries may correspond to an equivalent fitness level.

I would love to do a more thorough investigation of the nature of data simulated under different models, and comparing all of that to empirical data

This would certainly be interesting to explore; the challenge I suspect will come in defining the properties of empirical data in a quantifiable way. Mulvey et al. have a preprint that makes an important contribution here, and I think there'd be mileage in continuing the exploration of potential test statistics.

Thanks for the continued feedback on this front @ms609 - it is, as ever, appreciated.

-- I think follow this, after a few readings – I think this is saying that when option is enabled, <the pattern 1, 0, 0> may be <assigned?> to in the second environment. In general, I often find that the subject of 'this' or 'that' can be unclear, so it may be worth specifying more explicitly in the text.

I've tried to clarify this further by consistently referring to sites, rather than "bits" when the former is what I actually mean. This was sloppy terminology on my part, and I hope the distinction I have made makes it easier to follow.

-- What is not clear to me from this description is that shuffling the site patterns in the environment has any tangible difference from shuffling the site patterns in the observed genomes at the end of the simulation. Functionally, I don't see how the 'match peaks' is intended to differ from simply performing n replicates of an identical starting environment.

Shuffling the patterns in the environment will ensure that the gnome(s) capable of peak fitness differ between environments at the start of the run, but that they will have the same absolute value in terms of fitness. As such the hope is that in the early stages of a simulation, the evolutionary trajectories of the early population will not be weighted towards adapting to just one of multiple environments. What happens at the end of the simulation is the result of the evolution within the system as a simulation has run, so essentially we can try and start assessing evolutionary outcomes based on the adaptive landscape at the start of a simulation with a reasonable hope that the early evolution, at least, is representative of that adaptive landscape (it proved difficult to keep up the matched peaks approach without lots of user intervention). So if more environments == flatter landscape, however you want to define that, we can maximise the chances of the outcomes representing that, rather than just the population adapting to the environment which has the highest possible fitness peak.

-- This feels slightly too strong

Yes, I guess this depends on your reading of the word "direct". I have reworded to try and be more explicit as you suggest!

-- This would certainly be interesting to explore; the challenge I suspect will come in defining the properties of empirical data in a quantifiable way. Mulvey et al. have a preprint that makes an important contribution here, and I think there'd be mileage in continuing the exploration of potential test statistics.

Indeed, there is so much in emprical data that we don't have a full - or even - partial handle on. Thanks for the link to this preprint, it looks like important reading!

Great, the match peaks now makes sense, and your explanation has helped me to get my head around the motivation.

I can follow §Stochastic Layer now, but it's still a little obscure; "to create a 1 bit in the genome calculation" is still a slightly tricky phrasing to understand, though I can see what you are getting at. And there's a couple of typos in "whilst are 1100 may mao to 0".

Thanks for catching the typos! Yeah, I have struggled with how to make this as clear as I would like - I've tried with the above push to define many to one mapping when first used, and then given the example in the hope this makes it a little clearer!

palaeoware / trevosim

Software Paper: Statement of need #55