red / REP

Red Enhancement Process
BSD 3-Clause "New" or "Revised" License
11 stars 4 forks source link

Math functions should support percent! #151

Closed gr4xity closed 1 year ago

gr4xity commented 1 year ago

Non-linear transformations of percent-based values are common in scientific and business applications.

For example, probability is a percent! between 0% and 100%. Log-probability models are common because the log-probability of independent events is additive. Likelihoods are joint probabilities, and log-likelihoods are the basis of implementation for many statistical algorithms. The Logit transform is the difference between the log of the probability and the log of the alternative probability is the basis for most statistical inference around binary choice.

While probabilities are bounded between [0%, 100%], that's not true of percentages in general--reflected in Red's percent! type being unbounded as well. Percentage growth rates can generally exceed 100%, and negative growth rates indicate exponential decay. Exponential grow and decay problems are common across biology, physics, computer science, economics, and finance, and are key to many basic business applications including compound interest and return on investment.

To take advantage of these facts clearly and cleanly without unintended error, Red's mathematical functions should be restored to allow nonlinear transformation of percent! types with mathematical functions including exp, the various logarithms, square-root, etc.

As a data scientist who uses nonlinear transformations of probabilities for statistical modeling nonlinear transformation of growth rates for simulation and forecasting on a frequent basis, this issue is critical for making Red useful in these domains.

Trigonometric functions are a separate issue, but there are applications that treat percents as degrees out of 360.

Simply put: percent! is a number!

This change seems to have been made to non-trig functions without regard to mathematical applications in science and business. Nor was the change made consistently--the power function for example still supports percent! in both the number and exponent arguments, thankfully!

The need for nonlinear transforms of percent! values using exp and the various logarithm functions is even more pressing than the power transform!

Rebol 3 for example handled this correctly:

exp 5% == 1.05127109637602

exp -5% == 0.951229424500714

hiiamboris commented 1 year ago

As I understand this REP primarily concerns log-e log-2 log-10 exp power square-root functions and their aliases. To clarify, what is the proposed semantic exactly? That these functions accept percent but treat it as float and return a float? Because it doesn't make sense e.g. that exp or log of percent quantity would return percent.

Also what's missing here is a practical display of importance of such change. How much does it simplify the code? Because the proposal can be countered by simply suggesting converting argument to a float with 1 * x or to float! x, or wrapping natives:

>> native-exp: :exp
>> exp: func [{Raises E (the base of natural logarithm) to the power specified} value [number!]] [native-exp to float! value]
>> exp 10%
== 1.105170918075648

A case should be made why such suggestions do not solve the task at hand.

greggirwin commented 1 year ago

justinthesmith (Justin the Smith): I don't think hiiamboris was being rude in any way in the gitter chat about this. We are a global community and have many different interaction models and first languages.

When you say that it's not what funcs return, but what they accept, remember that in Red the type returned is often cast to the arg type (or left hand type, where changing arg types and order can affect results and reasoning about code). exp on ints necessarily returns a float, but R3 returns a float when given a percent. Why not return a percent? As @hiiamboris notes, we have to ask if this makes sense and look at how it's used.

To the point of use, having many datatypes is great, and we should leverage them as much as possible. But their uses can also be more domain specific than general, in real world scenarios. So we should find examples of where this feature would help, and design accordingly. It may be that the best solution for end users is not to support this feature generally, but to write some helpers that make it easy to use correctly in the domains where it applies.

gr4xity commented 1 year ago

Nonlinear transforms are valid operations on percents. Not only valid, but essential for countless applications across almost every human endeavor. I've shared some examples already. Here's more broad links on transforms.

https://www.kaggle.com/code/ohseokkim/linear-nonlinear-scaling https://medium.com/analytics-vidhya/advance-nonlinear-variable-transformations-ecedf3f8709e https://stattrek.com/regression/linear-transformation

The proposed semantic is simple: all mathematical functions should take percent! as inputs like any other number! and return a float! The behavior of R3's math functions on percent! is correct.

It's not only perfectly fine to return a float in these instances, that's the expected behavior. It wouldn't make sense to return a percent because the purpose of these nonlinear transformations is to take a percent on a probability or growth rate scale and transform it to a different numeric scale entirely. This is critical context that was missing from the original discussion.

For example, exponentiating a percentage growth rate returns a unitless number representing a multiplicative scaling factor. Not a percent!

The prior change broke expected mathematical behavior without any external references or engagement with a larger context. As you said in that previous thread, Gregg: "If we later add it back, after learning more, still nothing breaks."

dockimbel commented 1 year ago

As I understand this REP primarily concerns log-e log-2 log-10 exp power square-root functions and their aliases.

On first and quick look, the exposed rationale seems sound to me. Also such change does not break anything, so it's harmless. End users could be creating native wrappers to achieve the same behavior, though simply allowing percent values to be used is easier and cleaner. For the different datatype returned, given that only people for whom such math operations make sense would use it, the exception to the general rule should not harm anyone. We do want to make Red a good tool for data scientist too, don't we?

hiiamboris commented 1 year ago

I agree.

dockimbel commented 1 year ago

Implemented in new branch: https://github.com/red/red/compare/rep-151?expand=1

gr4xity commented 1 year ago

Beautiful, thank you so much!

There's a huge opportunity in data science and data engineering for something like Red.

Interpreted Python is over-extended and it's zero-based indexing makes implementing vector and matrix-based models a risk. More Lisp-like Julia failed to gain ground because its forced compilation made start-up times impractical.

State of the art metaprogramming and data modeling and transformation tools are awkward proprietary mismashes of SQL, Python, JSON, and YAML that get in the way as often as they help.

https://docs.getdbt.com/docs/use-dbt-semantic-layer/quickstart-sl https://cloud.google.com/looker/docs/what-is-lookml

Ick.

dockimbel commented 1 year ago

@gr4xity Thanks for the insights! Could you elaborate on "it's zero-based indexing makes implementing vector and matrix-based models a risk."?

dockimbel commented 1 year ago

Changes merged into master branch.

hiiamboris commented 1 year ago

What of https://github.com/red/red/commit/4c42c5b45efc08de24cf9935d685fd6c9dfe287b#commitcomment-127283534 ?

gr4xity commented 1 year ago

@gr4xity Thanks for the insights! Could you elaborate on "it's zero-based indexing makes implementing vector and matrix-based models a risk."?

Code implementing scientific models in Python is often riddled with +1 / -1 adjustments to translate between semantic meaning in the models and Python's 0-based offsets. Here's an example.

This creates numerous opportunities for off-by-one errors both over iteration and in application. For example, if the analytical solution you wish to report is in the 5th row of the 3rd column, you have to index for [4][2].

This is one of the biggest hurdles for researchers adopting Python, as Python is the only major "scientific computing" platform that uses 0-based offsets instead of 1-based indexing the way matrix math in science does, creating major risks to validity of computational implementations.

So Python's numerical computing packages have to provide warnings and help guides for users coming in from Matlab/Mathematica/Wolfram/Julia/whatever. Even Fortran was 1-based!

Example: Numpy for Matlab Users:

Python uses zero based indexing, so the initial element of a sequence has index 0. Confusion and flamewars arise because each has advantages and disadvantages. One based indexing is consistent with common human language usage, where the “first” element of a sequence has index 1. Zero based indexing [simplifies indexing].

But "simplifying indexing" is simply not a "problem" that scientific researchers have building computational models! This only creates friction and bugs that can be difficult to identify and fix. You may get an error when a loop iterator fails on an invalid index, but what if you errantly pull the wrong value from a matrix for application? No compiler or interpreter can help with the logic errors induced by Python's irrational indexing.

Python-based packages for doing matrix-based modeling often have to repeatedly warn users about this issue throughout their documentation. For example:

Note that the indexing of the entries starts at 0. Note that classes are indexed from 0 to n - 1 Indexing starts at 0.

hiiamboris commented 1 year ago

Makes sense, mathematicians indeed count from 1.

But in the general case it's tradeoffs both ways. E.g. in 1-based indexing to get first item of every row you write y - 1 * width + 1. The core issue is not the type of indexing, but the fact of indexing itself. Index usage should be banned from high-level languages in favor of vector operations.