Recalculate multiplication factor

ms609 commented 2 months ago

Per https://trevosim.readthedocs.io/en/latest/settings.html:

The multiplication factor required to achieve a set number of informative characters via subsampling depends on the settings of any given run. Selecting this option calculates and sets this factor empirically after the settings dialogue is closed by conducting a ten-run batch and working out the proportion of informative characters within those runs. This needs to be recalculated after any settings are changed. If it is not, or has not been set, the software will run with a large factor, and thus be slower than necessary.

I can see that this is a pragmatic approach, but as a general thought, I wonder:

how precisely can the "large factor" can be estimated – from the text I have the sense that the "large" factor is much larger than necessary, but are we talking about a 10× or 10% speedup by pre-calculating the factor?;
whether there might be mileage in allowing an option to "continue run until enough informative characters have been sampled" when "run to number of iterations" is selected?

From a user experience perspective, I wonder whether this option can be set apart slightly from the other settings, to make it clearer that it is not a setting per se, at least insofar as it will not affect the properties of the returned data. Conceptually it would be nice to replace the OK button with Confirm without recalculating factor + Confirm and recalculate factor, though I can see reasons not to do that – perhaps an \

above the box, or using a visually distinct tool such as an "on/off" toggle or radio button, would be enough to set it apart.

It would also seem logical to include the "Set factor manually" option in the dialog, rather than hiding it in a separate menu.

ms609 commented 1 month ago

Having now played with this a little, my experience is that for larger outputs (80+ taxa), calculating the factor takes hundreds to thousands of times longer than executing the run itself – and the factor is generally small enough (<2) that even a sizeable overestimate would still be faster than waiting for the estimation of the multiplication factor.

RussellGarwood commented 1 month ago

The above commits deal with this issue - thanks so much, @ms609, for taking the time to kick the tires of this system and report back on the overhead in terms of speed. This is something I should have done, but the system developed organically enough in the infancy of the software that doing so never struck me. I have addressed your comments as follows:

The default settings now have a factor or 5., which is a large enough factor to succeed under many, though I am sure not all settings
I have moved the "Recalculae strip uninformative factor" to the commands menu, so if it is ever required, it can still be used from there
When there are not enough characters, the warning highlighting the run has failed has been updated to point to both this new command option, and highlight that the factor can also be set manually there
I've modified the warnings about strip uninformative factor at simulation initiation to only occur when there is something very unexpected (i.e. a value below one, which can only be achieved by manually changing the factor in an XML file and loading those settings, so I don't expect those warnings will happen much).
I've also updated the software to reflect these changes to the functionality of the software
I expect to be changing the menu organisation down the line based on your other issues, so I expect some minor changes to the details of where to find these options down the line, but will update the documentation accordingly!

-- whether there might be mileage in allowing an option to "continue run until enough informative characters have been sampled" when "run to number of iterations" is selected?

This could be implemented without too much pain, but my gut feeling is that when a user is running something for phylogenetic methods studies (which is where I invisage stripping down to parsimony only informative characters is of utility), that having a constant terminal number is likely to be more important than not stripping characters. Happy to be convinced otherwise if you can see a clear use case though!

ms609 commented 1 month ago

The new changes look good.

Under Commands→Set uninformative factor, the default value in the text box is 1.00

would it make sense for this to match either the new default (5.0) or the current value (if one has been loaded from the settings file, or calculated from the current settings)? Indeed, the latter would have the benefit of allowing a user to see this value within TREvoSim, i.e. without saving the settings file and inspecting the XML.

ms609 commented 1 month ago

I note also that the warning message refers to "it seems that 5x is not enough", even if the user is not using the default settings but has specified a lower factor (manually, or via a settings file).

It might be nice (but might be tricky to implement?) if the warning gave an indication of how many informative characters the simulation ended up with compared to the target, and what the current factor was – then the user would have some inkling of what sort of "large number" might be appropriate for their current settings based on the output of the run.

RussellGarwood commented 1 month ago

Yes, good points. With the above commits, the set strip uninformative factor dialogue defaults to the current setting, and now the warning message also gives both the current number, and the number of informative characters that TREvoSim has managed to recover :)

ms609 commented 4 weeks ago

The new information box is now very informative and makes it very clear what the program has done and what the user can do to address this.

One thing that confused me is the connection between the "Number of characters" and that set in the settings.

In the settings I requested 512 informative characters, but the failure message states that 899 were recovered. I presume this means that 899 characters were computed (=512×strip uninformative factor of ~1.75), of which <512 (how many?) were informative. I suspect the 899 value is intended to list the number of informative characters, rather than the total number that were simulated?

palaeoware / trevosim

Recalculate multiplication factor #27