External area declarations

jgbm commented 5 years ago

Consider adding a new form of "external" area declaration, or some other way to introduce an area in a Habit program whose storage is provided elsewhere. This would provide support for something comparable to the following construct in LC:

external vram  = 0xb8000 :: Ref Screen

(The LC code actually uses the external f=g :: t mechanism to support this.)

fxdpntthm commented 5 years ago

Q: How do I set up a minimal test instance for this? A (1): http://web.cecs.pdx.edu/~mpj/pubs/bytedata.html A (2): https://github.com/zipwith/lc-baremetal

zipwith commented 5 years ago

It looks like the focus here has been on adding an external mechanism that is specific to area declarations. But the mechanism in LC is more general, and I think we'll need something comparable to the LC approach in Habit for the applications that we're starting to develop. Habit and mil-tools currently use primitive to mean different things; that's why the term external is used in LC, and is probably also why we're using that label here. But I think that the right place to add the more general version of this feature in Habit might well be in primitive declarations rather than area declarations. The example at the top of this issue might then be written as:

    primitive vram = 0xb8000 :: Ref Screen

More generally, we might have:

    primitive id = e :: ty

where id is used as an identifier of type ty within the Habit code, but will ultimately be implemented by the expression e. A key point here is that e does not actually need to have type ty (despite what the e :: ty portion of the syntax above might suggest). Instead, the type of e should have the same representation as ty. The vram example in LC works because Ref a values are ultimately represented as Word values, which matches the type of 0xb8000.

But the more general mechanism has also proved to be useful in other ways. For example, we might implement a toBits function for converting references into Word values by using something along the following lines:

   primitive refToBits = (\w -> w) :: Ref a -> Word

Two closing thoughts:

1) the syntax used here is awkward, not just because of the e :: ty issue mentioned previously, but also because we'd need to make sure e is an atomic expression to avoid parsing ambiguities. I'd be very happy to consider better notations. Maybe primitive id {e} :: ty or implement id :: ty using e, or ...

2) The examples above illustrate that this mechanism is inherently unsafe; we are only prepared to use 0xb8000 as a valid reference because the platform designers told us that was ok. But if we mis-type the constant value, for example, nothing will alert us to the error until we actually try to run the code. This is an unsettling feature for a language that values type safety guarantees. But it also seems unavoidable when you consider that what we are really working with here is a "foreign function interface". The only difference is that the "foreign" language is just a variant of the source language that exposes more details about low-level representations.

jgbm commented 5 years ago

I'm afraid I don't entirely follow.

My impression of the external areas was that, other than having a particular location in memory, they would be in all other ways like normal areas---i.e., that they might have initializers (depending on their types), would require ByteSize instances, and so forth. In this sense, assuming the back-end were clever enough not to use the same region of memory for two different things (say, by allowing an external area to live in the same space that it intended to use for the heap) there shouldn't even be any unsoundness. (Of course, if the user specified an impossible address---0, say, or something outside the valid range---that would probably go poorly...)

The refToBits example you've given seems a different thing to me. If our goal is simply to break type safety, we don't need new language features, we simply need to add

primitive typesWhoNeeds'Em :: a -> b

and have the back-end implement that as the identity function [1]. Of course, one could imagine using the latter to achieve the former; why not have

vram :: Ref Screen
vram = typesWhoNeeds'Em 0xb8000

but this seems like a very different thing; in particular, it wouldn't play nicely with any of the other area-specific features. (That said, it still wouldn't let you readRef a Ref (Stored (Int -> Int)), so there's some safety left... but it certainly would let you read garbage out of, say, a Ref (Stored (Ix 14)).) Perhaps it's sufficient for the few cases where we really need areas with specific locations?

[1] One could imagine calling this function something like unsafeCoerce instead, but that seems unfortunately anodyne.

zipwith commented 5 years ago

My comments may have pushed us into a language design discussion. Perhaps an "issue" like this isn't the best place for that? The "most significant bit" of my previous comment was to argue that we will need something more general than just support for external areas alone. I believe that the external construct mentioned in the initial message is a good candidate that also covers our current application for external areas, but I am open to other options.

Of course it is not our goal is to break type safety. But any program that includes either a primitive or an external area declaration could potentially be doing just that. (This includes, for example, anything that imports the prelude, and hence that potentially relies on the primitives defined there.) Each such declaration represents a promise that a named entity, defined elsewhere, will behave in a manner that is consistent with the type that has been declared for it. Bad things can/will happen if even one of those promises is broken. Declaring an external area for video RAM at address 0xB8000 is a reasonable thing to do on a traditional PC because that is the address that the hardware has been configured to use for interfacing to video. Or is it? Maybe I misread the data sheet. Maybe I mistyped the address as 0xB800, or 0xB8000, or 0x68000, or 0xd8000 (or maybe one of those was the correct address and I entered something else by mistake). Maybe the machine I'm running on doesn't actually have any video RAM hardware at any address. It's not just a matter of avoiding special addresses like 0 or the ranges covered by other memory areas: there may only be one valid address that can truly be considered as a value of type Ref Screen, and perhaps not even that. And if anyone writes code using a different address, then (a) there is no way for the compiler to detect the problem, and (b) there is a possibility that the resulting program might now have the ability to read/write from some supposedly protected memory region or to perform unintended memory-mapped I/O, simply by using the "video RAM".

We could have a similar problem with a declaration like primitive primNegate :: Unsigned -> Unsigned. When we write that, we need to trust that the associated implementation will be a pure function of the declared type. (It would be even better if it implements some kind of negation operation on Unsigned values...) But the compiler can't be expected to check this, and if the program is used with an implementation that doesn't satisfy this "contract", then we have no guarantees about what will happen because we have broken a key premise of the proof that "well-typed programs will not go wrong".

On the other hand, there is nothing inherently wrong with defining the coercion function that you've described as a primitive ... so long as you ensure that you will only run programs using that definition in environments where the coerce function is implemented in a way that honors its declared type. I can only think of one way to do that right now—(\x -> undefined)—and that clearly isn't going to be very useful in practice. But at least it is consistent with the declared type.

There is legitimate discomfort about the potential for misusing features like this in ways that could compromise fundamental properties of the language. But I do not know how to accomplish our practical goals without providing some way to access external functions. The external areas construct addresses one specific need, but although it is already dangerous, it does not provide all the functionality that we need.

The external id = e :: ty construct that I described has been implemented in LC and does seem to provide the necessary functionality. However, I am open to discussing other ways to meet those needs in Habit. If there is a better approach, I'll be all in. That said, it is probably worth identifying some of the features of the LC approach that might encourage us to consider it for Habit (or that might be useful in evaluating other options):

It satisfies the grep test, meaning that it should be possible to identify all the places in a given program that rely on external code like this with a simple text search of a program's source text. This would be a useful starting point if we were trying to audit a program for potential type safety violations. If all such declarations in a given program are in previously certified libraries, then our task will be much easier.
It allows us to write the implementation of primitives in the same file as their definitions, reducing the need to cross file boundaries or to understand details of linker operation.
It allows the implementation of primitives to be written in terms of types like Maybe, List, Unit, and tuples, that make sense in Habit, without having to figure out how to access and manipulate such values in another language. If the compiler introduces a different representation for any of these types, the implementation of the primitive and any places where it is used will automatically be updated to use the new representation. (In concrete terms, for example, I don't think we want to have to write one implementation of each primitive for use when the milc "b" pass is used, and another for when "b" is not used.)
It allows the optimizer to inline and optimize the code for a primitive at each point where it is used. For example, if a primitive is written in C and is expected to construct a Maybe value, then its implementation will potentially require some heap allocation. If the primitive is written in Habit, then it may be possible for the optimizer to fuse the code that allocates a Maybe value with code that consumes it, ultimately avoiding the need for allocation altogether.
Although it does not (and cannot) guarantee type correctness, there is still some checking that will raise a compile-time type error if the primitive and the implementation do not have compatible representation types. In principle, it might be possible to capture this in Habit source code with a suitably defined type class. In practice, it may not be desirable to add a dependency like this from Habit source code to details of back-end representation decisions. Even without a type class, the compiler can still do more to guarantee some level of compatibility than would be possible if the primitive were to be implemented in a completely different language.
Finally, although we do not have extensive experience with the proposed construct, we do have a working implementation in LC and a small but growing collection of use cases, mostly in the lc-baremetal repository, that show how the feature can be used in practice.

Again, I'm happy to explore other options, especially if they share some or all of the features above while also providing additional benefits. I'm less certain if this is an appropriate place to conduct that conversation; maybe we should transition to email?

habit-lang / alb

External area declarations #13