Making numbers in believe both conformant and fast

masak commented 5 years ago

This is the issue I hinted I would submit at https://github.com/luksamuk/believe/issues/6#issuecomment-545216093. I confess I've spent way too much time thinking about solutions to this challenge, both for this Bel implementation and for my own.

Per the guide,

Now we come to the code that implements numbers. Before we continue I'd like to remind the reader of the principle that when something is defined in the Bel source, the goal is to specify what it means and not to provide an efficient implementation.

In other words, we read the source for correctness. An implementation, however, shouldn't implement this directly but instead cheat creatively.

Base Proposal

Instead of directly pointing to a car and cdr part, I suggest Bel_pair consist of two function pointers that can return Bel values for car and cdr. Expressed as a Java interface:

interface Bel_pair {
    Bel car();
    Bel cdr();
}

C doesn't have interfaces, but the struct inheritance idiom accomplishes the same. We "program to the interface", that is, all consumers of Bel_pair go through the Bel_pair struct and its two function pointers, and know nothing else about the data layout.

I suggest these two "implementations" of Bel_pair:

"Regular" pairs
"Aggregate numbers"

Aggregate numbers only provide a facade that presents as a pair, that allows destructuring operations into it, etc. But they also hold extra data about the number itself. An aggregate number for the Bel number 5 would hold an int with the value 5.

(The interface idea can also be used to optimize a whole host of other types in the language. For example, functions can carry around a precompiled optimized version of itself which can be used in most circumstances. Arrays can be allocated as a single block and provide constant (rather than linear) random access. Queues can be implemented as real queues or deques, and provide similar access speedups.)

It's possible it'd be better for there to be many different types of aggregate numbers: unsigned integer, rational, signed rational, complex — following the Bel spec itself. Here I will write of them as a single aggregate number type, for simplicity.

Hijacking the built-in numeric operations

The main benefit of this is that we can now intercept any call on operations like +, check if its arguments are all aggregate numbers. In that case, we can skip the slow-path spec implementation of + and just add the numbers as C ints, and return a new aggregate number.

This is expected to be the typical case.

Also recommended that both the Bel reader implemented in C and (later) the Bel reader implemented in Bel be convinced to spit out aggregate numbers by default.

Heterogenous operations

Despite best-effort attempts to intercept the creation of numbers and turn them into aggregate numbers, a user could always build a number "by hand", and it'd not be turned into an aggregate number.

It'd be theoretically possible to check at join-time, but I'm wary of suggesting this. Instead, we should just be ready for getting a mix of numbers and aggregate numbers to + and friends. They should then "upconvert" the numbers to aggregate numbers and proceed as usual.

Doing an upconversion on a number is presumably slow, since we have to traverse down ridiculously long lists of ts. However, something already went to the trouble of building those lists, which was also slow, and we're just adding the same type of linear const once more.

De-optimization

In two situations, the aggregate number has to destroy itself and replace itself with the actual list structure it emulates:

When an internal pair is modified with xar or xdr
When an internal pair is compared with id

None of these situations are very likely, let alone very sane. But a conforming Bel implementation must support this case. A first cut at an implementation can simply crash in protest; later ones can do the replacement. It won't be fast, but it will be correct.

luksamuk commented 4 years ago

I was thinking about this right for a few hours. I'm not entirely sure about the interface-like idea...

all consumers of Bel_pair go through the Bel_pair struct and its two function pointers, and know nothing else about the data layout.

Function pointers are a no-no for me. The idea is to keep the implementation as simple as possible, and function pointers are one of the aspects which I'd rather not use at all. That is pretty much the reason why this software is being written as a literate program (though I must be fair, it lacks contribution guidelines explicitly saying all of that right now).

Plus, if we store car and cdr in Bel_pair as function pointers, deeper structural changes would need to be made. And not only that; since this is a literate program, explaining this "hack" on the prose parts would be... troublesome.

However, everytime I handled the car and cdr of any list, I used the bel_car and bel_cdr functions. How about making changes directly to those instead? This way, it would be only a matter of handling the special case on those functions.

The result of (type 5) returning the symbol number definitely has to go, but that can also be managed by handling the special case on the pertinent functions (which I still have to isolate. Adjustments will likely be made to bel_prim_type, but also to internal predicates such as bel_pairp).

A fix by changing bel_car and bel_cdr would involve converting directly a number to a literal, which are built-in and specified in the Bel spec:

Numbers in Bel take the form

(lit num (sign n d) (sign n d))

where the first (sign n d) is the real component and the second the imaginary component. A sign is either + or -, and n and d are unary integers (i.e. lists of t) representing a numerator and denominator.

So 2/3 is

(lit num (+ (t t) (t t t)) (+ () (t)))

and 4-1/2i is

(lit num (+ (t t t t) (t)) (- (t) (t t)))

Building the lists of t and this "coercion into a literal" itself would be as trivial as implementing a few more functions to do the job.

Other interesting thing is that we could also handle the (lit num ...) case in the bel_eval function as an "internal" special form; checking for a number literal is a very simple task, as well as doing a "reverse coercion" into a proper internal number as well. Since Bel specifies that valid numbers are printed as... well, numbers, that seems to be an option.

Plus, all arguments are evaluated during a function application, so this would work fine:

→ (+ (lit num (+ (t) (t)) (+ () (t))) (lit num (+ (t t) (t)) (+ () (t))) ) 3

...that is, if the number is valid -- otherwise, we just return the plain literal list, which would be an "invalid" number:

→ (lit num blah) (lit num blah) → (+ (lit num blah) 3) Error: (lit num blah) is not a number

This seems to fit perfectly, since performing any operations in "non-numeric literals" should also be invalid anyways. Changes would have to be made to bel_numberp and bel_number_list_p to make that work.

xar and xdr would need extra attention because they have to check for the special case as well, but that's manageable: we convert, change the pertinent parts, and attempt to return a re-conversion into a number (which can fail, of course, but in that case the plain new literal is returned).

As for id, I am not entirely sure; maybe converting all numbers to literals by default when checking for identity in bel_prim_id would be a sane option.

There is an extra bonus for this approach: the numerator and denominator in Bel are plain integers, but they don't have to be like this in Believe. However, if one just goes ahead and uses car, cdr, xar and xdr to mess with an "optimized" number, the conversion from "number" into "number literal" can automatically coerce the numerators and denominators into integers -- in fact, this coercion is already implemented. So we get the best of both worlds.

masak commented 4 years ago

Function pointers are a no-no for me. The idea is to keep the implementation as simple as possible, and function pointers are one of the aspects which I'd rather not use at all.

Yes, that makes sense. Function pointers have a high complexity penalty. I understand the desire to avoid them if possible. Even from a C language perspective it makes sense; C is more geared towards static, compile-time knowledge. Making things dynamic with function pointers moves the code out of C's sweet spot. Might even have an impact on the error messages the compiler is able to give, etc.

However, everytime I handled the car and cdr of any list, I used the bel_car and bel_cdr functions. How about making changes directly to those instead? This way, it would be only a matter of handling the special case on those functions.

It makes sense. By the way, this trade-off that you're deliberating is famous, and known as the expression problem — in short, do you associate differing behavior with the data (as function pointers would) or with the operations (as per your suggestion). I believe that c2 page and the one about switch statement smell is good reading in the context of this issue. (Not suggesting that there's a right answer here. That's the thing, it's a trade-off.)

xar and xdr would need extra attention because they have to check for the special case as well

Case in point. :smile: But it's not so bad, I think. What's interesting is the cost of doing this not just for numbers, but for some other type that can be optimized, later. But that could be premature design on my part.

As for id, I am not entirely sure; maybe converting all numbers to literals by default when checking for identity in bel_prim_id would be a sane option.

Not sure I understand. Would this guarantee that if you compared a number against itself, (id n n), you'd always get t?

luksamuk / believe