Updated naming conventions

rems-project / cn-tutorial

8 stars 8 forks source link

Updated naming conventions #84

Open bcpierce00 opened 2 months ago

bcpierce00 commented 2 months ago

This PR includes the original naming-conventions PR plus a bunch of extra tidying, polishing, and reworking in light of discussions. I believe everything is now consistent and the conventions are working pretty well. It is ready to merge IMHO.

bcpierce00 commented 2 months ago

@cp526 We are waiting for your confirmation before merging this, whenever you get back to work...

cp526 commented 2 months ago

That looks reasonable, but – without wanting to drag this revision out more – one point I'm not completely convinced about is requiring CN identifiers to be uppercase.

I see there's maybe some value in syntactically distinguishing C and CN variables for emphasising that C variables refer to (pure snapshots of) mutable objects whereas CN variables are immutable, but on the other hand CN specifications should (as much as reasonable) look natural to C programmers, who may want to follow the same naming conventions in their CN development as in their C code (possibly with project/organisation-specific naming conventions for identifiers).

A second concern is entangled with specification language design. Rust, Haskell, and OCaml all require lowercase variable names and uppercase constructor names. A benefit is that variables and constructors are syntactically distinct, so when the user misspells a constructor name in a pattern-match expression the compiler can detect this and not misunderstand it as a catch-all variable pattern (which would be a difficult-to-diagnose-bug). CN is currently fine in that regard even without an enforced restriction on the capitalisation of variable and constructor names, because CN constructors always take a parenthesis-enclosed list of arguments (Constr { bt1 arg1, ... btn argn }) using a (somewhat heavy-weight) record-like notation, which applies even when the constructor has no arguments. If we wanted to generalise CN datatype constructors so they can take arbitrary arguments, or no arguments, then we would need a different mechanism for syntactically distinguishing variables from constructors, and the capitalisation route Rust, Haskell, and OCaml take would be the obvious one. (But currently that's only a hypothetical problem.)

Either way, ignoring variables names, I see no benefit in requiring CN identifiers for types, record field names, or constructor argument names to be capitalised (though I may have missed some discussion about this), but would think that lower case names for these would better fit typical C code.

bcpierce00 commented 2 months ago

I'm pretty persuaded that CN variables should be uppercase -- the tutorial examples were unbelievably full of inconsistencies and confusions arising from different possible answers to the question, "When I Own a bit of heap structure from a C variable called x, what is the resulting CN value called?" I really want there to be exactly one standard answer (and X seems like the best one).

I care less about types, record fields, function names, etc.

Message ID: @.***>

PeterSewell commented 2 months ago

On Thu, 5 Sept 2024 at 20:05, Benjamin Pierce @.***> wrote:

I'm pretty persuaded that CN variables should be uppercase -- the tutorial examples were unbelievably full of inconsistencies and confusions arising from different possible answers to the question, "When I Own a bit of heap structure from a C variable called x, what is the resulting CN value called?" I really want there to be exactly one standard answer (and X seems like the best one).

Another thought about this: in the larger examples there will tend to be quite a lot of pure CN code, expressing the pure functional behaviour part of the spec - eg for pKVM the CN analogue of all of the executable-in-C spec here: https://github.com/rems-project/linux/blob/pkvm-verif-6.4/arch/arm64/kvm/hyp/nvhe/ghost/ghost_spec.c

Requiring all those CN term variables to be upper case seems quite noisy and unnatural to me. It also reminds me of the initial HOL choice to make in-the-logic identifiers upper case to be simply distinct from the SML identifiers, which I think they later regretted.

I care less about types, record fields, function names, etc.

Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/rems-project/cn-tutorial/pull/84#issuecomment-2332445642, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFMZZQ67IGFSRCX7WYISRTZVCTORAVCNFSM6AAAAABNPGFDSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZSGQ2DKNRUGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

thatplguy commented 2 months ago

What's the next step here? Continue with this PR as is, and try adopting the naming scheme @bcpierce00 proposed? Or is there an alternate proposal?

I expect we'll make more changes to the syntax over the next year, so there will be future opportunities to change the naming scheme as well, as painful as that is.

bcpierce00 commented 2 months ago

Renaming everything in the whole tutorial is a somewhat painful process, and it will get worse as the tutorial and reference materials get bigger, so it would be much better to reach agreement now if we can.

I think the proposals on the table are (1) go ahead with the PR as-is, or (2) try an experiment with a different scheme that "makes more things lowercase." Obviously, (2) would give us more information to guide people's preferences. But we are lacking an explicit proposal for precisely what the general rules should be and what guiding principles to use to settle edge cases.

Message ID: @.***>

thatplguy commented 2 months ago

Ok, that makes sense. Before we think about considering other schemes, let me recap to make sure I'm understanding correctly.

The main goal in adopting the uppercase naming scheme is to make a clear, explicit connection between CN-level things and C-level things, e.g.

for the C type, struct list_int, in CN we would have predicate List_Int and datatype List_Int
for the C variable sorted_nums, in CN we would have take Sorted_Nums = List_Int(sorted_nums)

The drawbacks are

Using uppercase names everywhere is clunky and not a common language convention.
It's not effective when C code also uses uppercase names.
Other languages use capitalization to syntactically distinguish constructors, which has some benefits as per @cp526's comment. (It's not clear to me that restricting case in this way is actually an option for CN, since C allows uppercase names.)
The naming scheme can still be confusing for C → CN lifting, e.g. the P in take P = Owned(p) IMO is a confusing name for the value pointed to by p.

Does that sound right?

thatplguy commented 2 months ago

I suppose it makes sense to list non-goals as well. Here's what we're not trying to accomplish with the uppercase naming convention:

Avoiding top-level name conflicts between C and CN. Library developers will already have a convention, and it would make sense to use the same convention for CN identifiers. E.g. the pthreads library prefixes pthreads_ to top-level identifiers, so CN top-level identifiers should use the same prefix (but maybe capitalized 🙂).
Syntactically distinguishing C vs. CN identifiers. If we're worried about developers getting confused by the type/provenance of an identifier, there are other approaches we should consider, like go-to-definition code navigation, showing the identifier type on hover in the IDE, etc.

bcpierce00 commented 2 months ago

Avoiding top-level name conflicts between C and CN. Library developers will already have a convention, and it would make sense to use the same convention for CN identifiers. E.g. the pthreads library prefixes pthreads_ to top-level identifiers, so CN top-level identifiers should use the same prefix (but maybe capitalized 🙂).

Syntactically distinguishing C vs. CN identifiers. If we're worried about developers getting confused by the type/provenance of an identifier, there are other approaches we should consider, like go-to-definition code navigation, showing the identifier type on hover in the IDE, etc.

In my mind (2) has been a "soft goal" of the naming conventions -- not to enforce a rigid separation, but to maintain a general convention that usually gives a clear hint whether someone is looking at a C thing or a CN thing.

But, indeed, I could give up on this if others disagree. The thing that seems crucial is offering a convention for naming analogous (heap) values in the C and CN worlds.

Message ID: @.***>

cp526 commented 2 months ago

From my perspective, maintaining a syntactic distinction between C things and CN things should be a non-goal. Indeed, in the implementation of the CN frontend we've put some effort to integrate C things more seamlessly into CN specifications (e.g. one can mention C variables without a quotation mechanism and "dereference" owned pointers in specifications).

It's not completely clear where exactly the line should be drawn anyway: in take v = Owned<int>(p), is v a CN thing or a C thing? The name v is introduced by CN, but the pointee of p exists in C just the same.

cp526 commented 2 months ago

A CN identifier that represents the state of a mutable data structure after some function returns should be named the same as the starting state of the data structure, with an _post at the end.

E.g., The list copy function takes a linked list l representing a sequence L and leaves l at the end pointing to a final sequence L_post such that L == L_post. (Moreover, it returns a new sequence Ret with L == Ret.)

From my understanding of earlier comments, this part of the scheme is one of the main motivations for using uppercase letters for CN identifiers -- so one can mechanically derive a good name for a CN variable introduced by a resource take binding (?).

A different option would be to use l_pre and l_post in place of L and L_post, or to number different "generations" of l as l1, l2, etc. Either of these would make the pre-state names a bit longer, but it would make for more uniform names in the presence of loops, which will need variable names such as l_inv or l3 or similar.

The text above is from the section 'For new code', but for code not written by us (the more interesting case), a deterministic naming scheme is trickier.

If we're not writing the C code, the mechanically derived name won't always be good, as in Cole's example; e.g. the linked list pointer from above (there l) might have been called p, and take P = LinkedList(l) would not be a great name for the mathematical list. Perhaps a type-based naming scheme, whereby CN identifiers are always prefixed with letters determined by their CN type (or the resource predicate whose output they bind) could work? As discussed on a previous call, this could actually be mechanically enforced by CN, as Lem does, which allows users to associate prefix naming rules with type definitions.
The close correspondence found in tutorial examples of (a) a C-type definition, (b) the associated CN resource predicate, and (c) the mathematical value represented by it may not be the common case in real-world examples (not that we've done enough of those yet), but perhaps that's obvious. In the examples we've done in CN, we've had resource predicates that "package together" ownership for different related bits of memory at different C-types and associated constraints; we'll also sometimes want resource predicates whose output does not include all the information represented in memory, for instance a page table predicate whose output is not table-shaped but a mapping from physical to virtual addresses (so a looser correspondence of datatype and resource predicate).

cp526 commented 2 months ago

Ok, that makes sense. Before we think about considering other schemes, let me recap to make sure I'm understanding correctly.

The main goal in adopting the uppercase naming scheme is to make a clear, explicit connection between CN-level things and C-level things, e.g.

for the C type, struct list_int, in CN we would have predicate List_Int and datatype List_Int

for the C variable sorted_nums, in CN we would have take Sorted_Nums = List_Int(sorted_nums)

The drawbacks are

Using uppercase names everywhere is clunky and not a common language convention.

It's not effective when C code also uses uppercase names.

Other languages use capitalization to syntactically distinguish constructors, which has some benefits as per @cp526's comment. (It's not clear to me that restricting case in this way is actually an option for CN, since C allows uppercase names.)

It would be an option if we needed it, because pattern matching can only ever introduce new CN variables. If CN restricted variables to lower-case names and constructors to upper-case names, an uppercase string in a pattern (in a pattern-matching expression) could reliably be interpreted as a constructor name since it could not be introducing any new C variables.

The naming scheme can still be confusing for C → CN lifting, e.g. the P in take P = Owned(p) IMO is a confusing name for the value pointed to by p.

Does that sound right?

thatplguy commented 2 months ago

A different option would be to use l_pre and l_post in place of L and L_post, or to number different "generations" of l as l1, l2, etc. Either of these would make the pre-state names a bit longer, but it would make for more uniform names in the presence of loops, which will need variable names such as l_inv or l3 or similar.

I was just going to suggest something similar, e.g. _in (requires), _out (ensures), and _inv (loop invariants). I reviewed the tutorial examples up to (not including) the case studies, and I think this would work, along with a few other complementary changes:

No requirements for local CN variables, e.g. let sum = (i64) x + i64 y; is fine.
Predicates that convert C memory objects to a CN logical representation are named ctypename_at, e.g. sllist_at.
Logical datatype representations often will naturally have different names than their C counterparts, e.g. List vs. sllist. When they coincide, prepend cn_ to the logical datatype, e.g. cn_sllist.

FWIW this is how I found myself naturally writing CN specs before @bcpierce00 started on the style guide.

I think it's also worth being more explicit somewhere in the tutorial about this style of verification, where one builds a separate (logical) representation of correct behavior and then relates the logical representation to the C code – calling that out lets us make the point that one should pick some naming scheme for the logical representation that reflects the corresponding C implementation, even if it's not the one we recommend.

bcpierce00 commented 2 months ago

I agree that the l_in / l_out convention serves the (critical) purpose of establishing a clear rule for "what to call the thing we get when we take Owned(l). I also like the symmetry of the _in and _out suffixes.

I remain uneasy about mashing the C and CN worlds together -- to me, l_pre suggests "the state of the heap structure l at the beginning of the function," rather than "the mathematical abstraction of this heap structure," and I fear our users are going to get tripped up by this. At the very least, we are increasing the load on the tooltips in the Verse IDE, which will now bear more of the responsibility of helping users understand which world a given identifier belongs to. But I sense I am not convincing people on this point.

cp526 commented 2 months ago

I agree that the l_in / l_out convention serves the (critical) purpose of establishing a clear rule for "what to call the thing we get when we take Owned(l). I also like the symmetry of the _in and _out suffixes.

*_in and *_out is quite good, and (positively) a bit less math-sy than *_pre and *_post; *_inv unfortunately isn't quite symmetric (in either scheme, with pre/post or with in/out). Maybe there's something better than *inv*? *while or *loop? Still not symmetric.

I remain uneasy about mashing the C and CN worlds together -- to me, l_pre suggests "the state of the heap structure l at the beginning of the function," rather than "the mathematical abstraction of this heap structure," and I fear our users are going to get tripped up by this. At the very least, we are increasing the load on the tooltips in the Verse IDE, which will now bear more of the responsibility of helping users understand which world a given identifier belongs to. But I sense I am not convincing people on this point.

I better understand this concern now, but I wonder whether this problem, in fact, arises from how we are using the name of the pointer (here l) in constructing the name of the resource output? If the pointer, for instance, as in Cole's example above, is called p, then neither P nor P_in nor p_in is a good name for the output, and even the capitalised P or P_in might be misunderstood as the heap state under p, no?

Would it not be sufficient, and make specifications clearer, if the user could more freely pick a suitable name for the resource output -- perhaps based on its type, as previously suggested -- and we advise them to uniformly use the *_in/*_out suffixing scheme for consistency between preconditions, postconditions, and loop invariants? For instance, the user might choose a convention to call list-typed variables xs, ys, etc., and then specify take xs_in = List__I32(p) in the precondition and take xs_out = List__I32(p) in the postcondition. In the case of Owned this scheme might lead to names such as take i_in = Owned<int>(p) and take node_in = Owned<struct node>(p). The user could choose to include the name of the pointer p in the name of the resource output, perhaps to clearly disambiguate between multiple outputs of the same kind, but I'm not sure it's necessary (or always useful) to insist on it. If we decided to generalise CN resource predicates to not require a first pointer-typed argument ( https://github.com/rems-project/cerberus/issues/303 ) -- not clear right now -- we would lose that "hook" for mechanically deriving the output name.

For completeness, https://github.com/rems-project/cerberus/issues/312 discusses replacing the take notation. If we changed the notation for resource assertions to not require names for resource outputs, but instead (as discussed in some form there) had alternative syntax for referring to the outputs of a resource without such names, then this aspect of the naming scheme would become less important.

thatplguy commented 2 months ago

I remain uneasy about mashing the C and CN worlds together -- to me, l_pre suggests "the state of the heap structure l at the beginning of the function," rather than "the mathematical abstraction of this heap structure," and I fear our users are going to get tripped up by this. At the very least, we are increasing the load on the tooltips in the Verse IDE, which will now bear more of the responsibility of helping users understand which world a given identifier belongs to. But I sense I am not convincing people on this point.

Hm. Maybe I'm confused in the same way we fear our user might be 🙂 . In my mind, the important distinction is made by the types involved. Our verification strategy, broadly speaking, is to develop a CN model of correct computation and relate it to the C implementation. The model happens to use richer abstractions, like mathematical integers, to more easily represent programmer intentions.

In that sense, the v in let v = Owned<int>(p) is in fact representing the state of the heap at p, because v has type int. But when we use a custom resource predicate to project the heap structure into a CN datatype, we're relating the implementation to the mathematical abstraction denoted by the datatype. The same thing happens when we lift v into a larger integer type, e.g. let sum = (i64) v + (i64) 1.

If that's more or less accurate, and the types and the location of their definitions (in C vs in CN) convey the distinction, then I'd be in favor of relying on the tools for showing types and definitions in the IDE rather than pushing that information into naming conventions.

bcpierce00 commented 2 months ago

Your let v = Owned<int>(p) example is telling — this really captures the inherent confusion that has been nagging at me for some time, and that has made me give u- on the idea of really, fully separating the two worlds. I agree that the ‘embedding C in a larger world’ perspective is the right one.

Message ID: @.***>