nci / scores

scores: Metrics for the verification, evaluation and optimisation of forecasts, predictions or models.
https://scores.readthedocs.io/
Apache License 2.0
67 stars 18 forks source link

Update binary discretise to use operator #739

Closed nicholasloveday closed 4 days ago

nicholasloveday commented 5 days ago

https://github.com/nicholasloveday/scores/blob/1485879a1b8171de7ca89cb124680c60c6cc198d/src/scores/processing/discretise.py#L97

Currently strings are provided as the mode arg. This should be replaced with the Python operator module.

Relevant comments from @tennlee

Part (but not) all of the reason for this is the fact the visually-identical strings can be different unicode codes, and not match. This is less common for symbols, but all too common for characters. See e.g. https://gist.github.com/StevenACoffman/a5f6f682d94e38ed804182dc2693ed4b . There are other reasons too, such as type handling on operators and static analysis. Many coders will use strings to indicate "mode flags", but more often than not, this is an indication that the code can be improved to be more reliable, more specific and often faster as a result. Sometimes, modes made sense, but it's always a hint to step back and think carefully about the code at hand. In this case, passing the function pointer is more specific in terms of the actual function being executed. operator.gt will have a specific set of semantics, whereas your mode flag could potentially vary from those semantics, so it conveys useful information to the developer to use the function pointer option. I have added some commentary to this point to try and illustrate some of the reasons for this comment. The comments reflect somewhat the "developer" perspective rather than the "sciencey" perspective, but there are reasons for it.

Collaborator @tennlee tennlee 16 hours ago Another reason, which does not apply in this case, is that user-supplied strings are frequent causes of unexpected bugs, and often security issues. This is because it's easy to embed unexpected content into user-supplied strings. The main security issue comes if the string is directly supplied from a potential attack surface (such as a web form field), and that's not relevant here. However, users sometimes pass-through strings from command-lines or user interface buttons, and so the cause of the error may be far from the interpreting code. String construction is also often done, where the user will manually join or process strings to come up with the argument. This is really quite error-prone usually. Requiring a function pointer will bypass a large number of these string processing failure modes, and will result in more reliable code for both libraries (us) and systems (user code).

tennlee commented 4 days ago

Fixed by #740