Add human-readable instruction name annotations

ThinkOpenly commented 3 months ago

A simple implementation is proposed to annotate each instruction with its human-readable name.

Two methods are used, given the mildly diverse instruction implementations in Sail:

An annotation added to the union clause ast for "simple" instructions that are not defined as a group:
```
$[name "jump and link"]
union clause ast = RISCV_JAL : (bits(21), regidx)
```
For "simple grouped" instructions where the set of mnemonics are defined as part of a simple mapping, an annotation is added to each relationship of the mapping:
```
mapping utype_mnemonic : uop <-> string = {
$[name "load upper immediate"]
RISCV_LUI   <-> "lui",
$[name "add upper immediate to PC"]
RISCV_AUIPC <-> "auipc"
}
```

Other methods may be required for more complicated mnemonics, especially those with varied suffixes, but these methods easily cover most instructions. This, at least, is a start.

These annotations allow a parser/backend to have relevant human-readable information for such purposes as creating documentation.

This PR is currently marked as a draft because (1) it is incomplete, and (2) we seek feedback on the approach before continuing the work. (Thanks!)

jrtc27 commented 3 months ago

Unless the intent is to move the entirety of riscv-isa-manual's instruction semantics descriptions into sail-riscv (which is doable, it's what we do for cheri-specification with sail-cheri-riscv, but I doubt will happen here), I struggle to see how this achieves anything other than duplicate existing descriptions?

github-actions[bot] commented 3 months ago

Test Results

712 tests ±0 712 :white_check_mark: ±0 0s :stopwatch: ±0s 6 suites ±0 0 :zzz: ±0 1 files ±0 0 :x: ±0

Results for commit 6670d78a. ± Comparison against base commit 05b845c9.

:recycle: This comment has been updated with latest results.

Alasdair commented 3 months ago

This would be very useful for some of the documentation things I would like to do, so I am quite strongly in favour of adding this metadata.

Timmmm commented 3 months ago

Yeah I would also like more documentation metadata here for a RISC-V VSCode extension I'm working on. I think there's quite a bit of demand for machine readable metadata & doc type stuff (c.f. Qualcomm's other ISA specification language), so we should probably start actually doing it.

I think at least for the names there isn't really any risk from keeping stuff in sync, and it also helps understand for people who don't know what the ridiculously terse mnemonics mean.

How do these name annotations get exposed? Do they end up in the doc JSON generated by Sail?

ThinkOpenly commented 3 months ago

Unless the intent is to move the entirety of riscv-isa-manual's instruction semantics descriptions into sail-riscv

There's no room for iterative improvement?

I struggle to see how this achieves anything other than duplicate existing descriptions?

To what "existing descriptions" are you referring? I found just a few in the vector code, but little else of actual use.

The goal is to give helpful human-readable names to the mnemonics. Even the current AsciiDoc source doesn't do a consistently good job here.

ThinkOpenly commented 3 months ago

How do these name annotations get exposed? Do they end up in the doc JSON generated by Sail?

Currently, the net effect will be to add the annotations to the associated node in the AST. The current backends ignore the new content. Our out-of-tree JSON backend pulls the annotations into the JSON, associated with the respective mnemonic:

{
  "instructions": [
    {
      "mnemonic": "addi",
      "name": "add immediate",
      "operands": [...]
      "fields": [...]
[...]

ThinkOpenly commented 3 months ago

I've fixed the issue identified by @jrtc27 (by changing "store fence" to "supervisor memory-management fence" as found in Volume II: RISC-V Privileged Architectures V20211203, section 4.2.1 "Supervisor Memory-Management Fence Instruction").

I'm removing the "draft" tag and marking "ready for review".

jrtc27 commented 3 months ago

I do not believe this should be here. These names are not official, they are invented, yet they are being put in the official model. Naming the instructions is not the responsibility of the model but of the documentation.

Unless the intent is to move the entirety of riscv-isa-manual's instruction semantics descriptions into sail-riscv

There's no room for iterative improvement?

Not if the end goal isn't to avoid duplication.

I struggle to see how this achieves anything other than duplicate existing descriptions?

To what "existing descriptions" are you referring? I found just a few in the vector code, but little else of actual use.

Words in the ISA manual. That's where the descriptions currently belong

The goal is to give helpful human-readable names to the mnemonics. Even the current AsciiDoc source doesn't do a consistently good job here.

Why is that the Sail model's responsibility? If the AsciiDoc sucks, fix the AsciiDoc.

ThinkOpenly commented 3 months ago

I do not believe this should be here. These names are not official, they are invented, yet they are being put in the official model.

That is a valid concern. There should be canonical human-readable names associated with each instruction mnemonic. These should be enumerable and easily associated with the mnemonics, and the best place to represent things like that is in the Sail code. I looked through the PR to see if the names came directly from the ISA documentation or were "invented".

Pulled from ISA doc: lui, auipc, jal, jalr, "conditional branch", slti, "load", "store", ecall, ebreak, wfi, sfence.vma
"Invented", but with strong analogs/hints in the ISA doc: addi, andi, ori, xori, sltiu, add, slt, sltu, and, or, xor, sll, srl, sub, sra, addiw, addw, subw, sllw, srlw, sraw, slliw, srliw, sraiw
"Invented", but analogs in the ISA doc: slli, srli, srai (actually, I see these aren't in the PR. These should be added.)
"Invented": fence, fence.tso, mret, sret

Note that any "invented" instruction names were due to the lack of a clear instruction name in the ISA doc.

Naming the instructions is not the responsibility of the model but of the documentation.

I can see your point, but I disagree. If there is to be complete separation between the model and the documentation, then there is a lot that needs to be removed from the documentation. Since humans are writing the model, other humans should be able to read it. What's being added here is annotations, essentially comments that end up in the AST, and otherwise ignored. They make the Sail code better, and easier to understand and interpret.

To what "existing descriptions" are you referring? I found just a few in the vector code, but little else of actual use.

Words in the ISA manual. That's where the descriptions currently belong

I'll disagree here, too. I think the model needs to stand pretty well on it's own. You mentioned earlier the "sail-cheri-riscv" effort to merge the documentation with the model, which sounds quite useful. Why is it OK there, but not OK here? Does it have to be all-or-nothing?

The goal is to give helpful human-readable names to the mnemonics. Even the current AsciiDoc source doesn't do a consistently good job here.

Why is that the Sail model's responsibility? If the AsciiDoc sucks, fix the AsciiDoc.

I think we should address inadequacies in the Sail code as well.

PeterSewell commented 3 months ago

It'd be no bad thing for RISC-V to have standardised long names, and if it did then it'd be useful to include in the model, but:

I don't think the formal modelling group should unilaterally invent them,
they should be made to appear uniformly and at least semi-automatically in the docs, with a scheme for doing that agreed up-front, and
having them requires one to agree and freeze exactly what is named, that would need some thought and agreement. There are many choices in what one deems to be a single "instruction" (with multiple variants) vs multiple distinct "instructions". The long names might or might not simply correspond to the mnemonics. In some cases one wants to model to handle a family of instructions uniformly (eg all the ITYPE instructions together, as it does now, following the original prose spec) and then in the documentation either show that uniformity or split them apart - as in the "one page per instruction" view that some people have asked for, that Alasdair prototyped.

Peter

On Mon, 12 Aug 2024 at 21:24, Paul Clarke @.***> wrote:

I do not believe this should be here. These names are not official, they are invented, yet they are being put in the official model.

That is a valid concern. There should be canonical human-readable names associated with each instruction mnemonic. These should be enumerable and easily associated with the mnemonics, and the best place to represent things like that is in the Sail code. I looked through the PR to see if the names came directly from the ISA documentation or were "invented".

Pulled from ISA doc: lui, auipc, jal, jalr, "conditional branch", slti, "load", "store", ecall, ebreak, wfi, sfence.vma

"Invented", but with strong analogs/hints in the ISA doc: addi, andi, ori, xori, sltiu, add, slt, sltu, and, or, xor, sll, srl, sub, sra, addiw, addw, subw, sllw, srlw, sraw, slliw, srliw, sraiw

"Invented", but analogs in the ISA doc: slli, srli, srai (actually, I see these aren't in the PR. These should be added.)

"Invented": fence, fence.tso, mret, sret

Note that any "invented" instruction names were due to the lack of a clear instruction name in the ISA doc.

Naming the instructions is not the responsibility of the model but of the documentation.

I can see your point, but I disagree. If there is to be complete separation between the model and the documentation, then there is a lot that needs to be removed from the documentation. Since humans are writing the model, other humans should be able to read it. What's being added here is annotations, essentially comments that end up in the AST, and otherwise ignored. They make the Sail code better, and easier to understand and interpret.

To what "existing descriptions" are you referring? I found just a few in the vector code, but little else of actual use.

Words in the ISA manual. That's where the descriptions currently belong

I'll disagree here, too. I think the model needs to stand pretty well on it's own. You mentioned earlier the "sail-cheri-riscv" effort to merge the documentation with the model, which sounds quite useful. Why is it OK there, but not OK here? Does it have to be all-or-nothing?

The goal is to give helpful human-readable names to the mnemonics. Even the current AsciiDoc source doesn't do a consistently good job here.

Why is that the Sail model's responsibility? If the AsciiDoc sucks, fix the AsciiDoc.

I think we should address inadequacies in the Sail code as well.

— Reply to this email directly, view it on GitHub https://github.com/riscv/sail-riscv/pull/520#issuecomment-2284843624, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFMZZXU2G4SMYJLVZT5OVTZREKYRAVCNFSM6AAAAABMHTEU66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBUHA2DGNRSGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

arichardson commented 3 months ago

While I agree we shouldn't invent names here that are not official, I do think having names at least for the ones where they are known is a useful step towards being able to generate per-instruction documentation from the sail model.

I don't think this change has any effect on the readability (if anything it makes some of the more obscure mnemonics more obvious without having to find them in the ISA reference) so I'd be fine with this change (at least for the ones with existing long names).

ThinkOpenly commented 2 months ago

This PR was discussed in the Tech Golden Model meeting today.

What I heard was a general consensus that this additional information is needed, to improve the readability of the Sail model and to facilitate uses downstream.

The concern about adding names which do not come from the ISA Specification should be addressed in parallel: add the names to Sail, since they are ready to go (although see below), and propose adding the same names to the ISA Specification. If those additions result in changed names, then the names in the Sail code can easily be changed to match.

There was discussion about Sail-to-Asciidoc that I did not follow.

The current implementation covers those mnemonics for which the mnemonic is not "constructed", but appears as a fully-specified continuous string. In contrast, there are instruction groups where the set of mnemonics is constructed, like "LOAD":

mapping clause assembly = LOAD(imm, rs1, rd, is_unsigned, size, aq, rl)
  <-> "l" ^ size_mnemonic(size) ^ maybe_u(is_unsigned) ^ maybe_aq(aq) ^ maybe_rl(rl) ^ spc() ^ reg_name(rd) ^ sep() ^ hex_bits_signed_12(imm) ^ "(" ^ reg_name(rs1) ^ ")"

Here, the mnemonic is constructed from "l" + { "", "u" } + { "", ".aq" } + { "", ".rl" }.

Looking just now, it is easy to add annotations conveying the intent of each constructive element (".aq" == "acquire"; ".rl" == "release"). I'll do that now, and push new changes here.

(My out-of-tree Sail JSON backend will need to add code to glue the pieces together appropriately in order to construct each full human-readable instruction name, but that's a simple matter of programming.)

Timmmm commented 2 months ago

There was discussion about Sail-to-Asciidoc that I did not follow.

I think the suggestion was that the Asciidoc backend (which actually outputs JSON) would include the annotations you've added, so you (maybe) wouldn't need a custom JSON backend.

jrtc27 commented 2 months ago

I disagree with the path forwards. There should be agreement from riscv-isa-manual people before committing to this in the Sail model. Otherwise you'll have to rip them out again if they disagree.

Alasdair commented 2 months ago

We had people from the RISC-V documentation side of things at the meeting today, and we don't think that will be an issue.

I think the suggestion was that the Asciidoc backend (which actually outputs JSON) would include the annotations you've added, so you (maybe) wouldn't need a custom JSON backend.

It actually already does, but I don't think the Asciidoctor plugin has any commands that use them, e.g.

$[name "long name"]
$[simple_attribute]
$[complex_attribute { extra_data = 3, more_data = "some string" }]
val short : unit -> unit

will generate:

  "vals": {
    "short": {
      "val": {
        "source": {
          "contents": "val short : unit -> unit",
          "file": "test.sail",
          "loc": [ 7, 150, 150, 7, 150, 174 ]
        },
        "type": {
          "contents": "unit -> unit",
          "file": "test.sail",
          "loc": [ 7, 150, 162, 7, 150, 174 ]
        },
        "attributes": [
          [ "name", "long name" ],
          "simple_attribute",
          [
            "complex_attribute",
            { "extra_data": 3, "more_data": "some string" }
          ]
        ]
      }
    }
  },

jrtc27 commented 2 months ago

Is there a reason why attributes is an array of strings or pairs, rather than being a map from name to (optional) value (with something like true or {} as the value for attributes with no value)?

Alasdair commented 2 months ago

No particular reason, it just seemed like a fairly simple and lossless JSON encoding. A default value of true would erase the difference between $[attr true] and $[attr] for example.

jrtc27 commented 2 months ago

Or null, if you're concerned about being able to differentiate those two. I struggle to see when you would, but null vs not present seems even more unlikely to matter.

Alasdair commented 2 months ago

Null would work in the JSON output as the Sail attributes are essentially JSON without null, and a slightly more Sail-ish syntax ('=' rather than ':', and unquoted keys). The array output does preserve the order, which most JSON parsers would not guarantee for an object, and it also allows the same attribute to appear twice, which is legal but of dubious value.

jrtc27 commented 2 months ago

If you need to preserve order or support multiple instances of the same output, then yeah, you can't use an object. But if neither of those are requirements then an object is much easier to work with.

Alasdair commented 2 months ago

I could do [simple_attribute null] if the extra consistency makes it easier to work with, but I think in general it should precisely represent the OCaml type (string * attribute_data option) list.

jrtc27 commented 2 months ago

I don't think that really helps, the big pain point is having to iterate to find anything (e.g. trying to extract attributes in jq is likely messy).

Alasdair commented 2 months ago

Not sure that making it easier to use with jq is worth losing the ordering information from the source, and jq does have ways to search lists in queries (I think .select(first | contains("attr")) for the more consistent list with null) so it's not even that bad surely?

ThinkOpenly commented 2 months ago

FYI, I've added a commit here which adds annotations for the constructed parts of some instruction groups (like LOAD, STORE), as described in my last comment.

I think this at least addresses concerns about "full coverage" for the base instructions.

Whether these names need to go first into the ISA Spec is the only remaining discussion point to my knowledge.

AFOliveira commented 2 months ago

I find the content of this PR really interesting and I have a question on this: I have previously tried to do a similar approach to what that back-end is doing in generating the json file and had trouble with how instruction names are defined in SAIL since I could not find any direct standard or correlation. Has this been (or still is) a problem in generating the back-ends?

ThinkOpenly commented 2 months ago

I [...] had trouble with how instruction names are defined in SAIL since I could not find any direct standard or correlation.

If you are referring to human-readable instruction names, there are no human-readable instruction names in the RISC-V Sail code. This PR provides one way of doing so.

Further, there aren't even human-readable instruction names in the RISC-V ISA Specification consistently. This is one of the current objections to this PR, that by adding those names in this way, a standard is being implicitly created without whatever proper review would be warranted. Although, it's not completely clear what review is necessary... is a full new ratified version of the spec with instruction names required before adding names to the Sail model, even though it has also been argued that the names are not part of the model proper.

Next steps with this PR are unclear to me.

Has this been (or still is) a problem in generating the back-ends?

The current in-tree backends of the Sail parser, to my knowledge, have no interest in human-readable instruction names. (Or if they do, are starving for lack of them.)

AFOliveira commented 2 months ago

If you are referring to human-readable instruction names, there are no human-readable instruction names in the RISC-V Sail code. This PR provides one way of doing so.

What I was mentioning was a way to describe instructions in which we know what instruction we are concretely mentioning. Your approach would definitely solve this problem for humans, but being long format wouldn't it hurt machine readibility, at least from a parsing stand point? What I was trying to question is if there are concrete tags in the sail models that are easy to refer to, take this examples:

JAL - it is named RISCV_JAL in riscv_insts_base.sail
RORIW - it is named RISCV_RORIW in the riscv_insts_zbb.sail
C_NOP - it is named C_NOP in the riscv_insts_zca.sail

Do this names follow any standard? Because I am not seing which standard they are following and I believe that if they could be at least EXT_INSTRUCTIONNAME it would be easier to digest the sail code and use it for different things.

ThinkOpenly commented 2 months ago

Your approach would definitely solve this problem for humans, but being long format wouldn't it hurt machine readibility, at least from a parsing stand point?

The only purpose for the "long format" is just for human-readable names. They are annotations, like comments but they end up in the AST after parsing, so the backends can see them and make use of them (or not).

What I was trying to question is if there are concrete tags in the sail models that are easy to refer to, take this examples:

JAL - it is named RISCV_JAL in riscv_insts_base.sail

RORIW - it is named RISCV_RORIW in the riscv_insts_zbb.sail

C_NOP - it is named C_NOP in the riscv_insts_zca.sail

Do this names follow any standard?

These were established long before my involvement, so I am not the right person to answer, but I think the answer is an easy "no".

Because I am not seing which standard they are following and I believe that if they could be at least EXT_INSTRUCTIONNAME it would be easier to digest the sail code and use it for different things.

In a previous PR (#506), I changed the way that instructions are "tagged" by their containing extension to hopefully make it easier for a parser backend to determine that. If that is insufficient, let's talk about ways to improve it further.

As for "INSTRUCTIONNAME", that's not so easy given the grouping for similar instructions. For example, both "lui" and "auipc" are in the "UTYPE" definition. What should the single "instruction name" be? :-)

AFOliveira commented 2 months ago

I now understand the proper reason of this PR, sorry for my misunderstanding. Although, I still think that my point on instruction naming is valid, I also definitely do not have a proper solution for that, I think I'll take some time to better understand if there is a way to make this better and then file an issue or PR on that separate topic to get more opinions on this. All in all, thank you for your explanation, it was really useful and I totally agree with the need for a proper long naming standard on RISC-V instruction documentation.

jordancarlin commented 2 months ago

@ThinkOpenly It looks like the riscv-unified-db project has also been adding long_names to each instruction. It probably makes sense for these to be consistent with whatever ends up being used for the Sail model/ISA manual so it might be worth connecting with that project.

jrtc27 commented 2 months ago

Or that should be your source of truth for the instruction names, not the Sail model, if there's already such a thing...

ThinkOpenly commented 2 months ago

@ThinkOpenly It looks like the riscv-unified-db project has also been adding long_names to each instruction. It probably makes sense for these to be consistent with whatever ends up being used for the Sail model/ISA manual so it might be worth connecting with that project.

Thanks for the pointer @jordancarlin. I'm aware of the project, and getting more connected with it. The question is, where are those names coming from?

Or that should be your source of truth for the instruction names, not the Sail model, if there's already such a thing...

I don't understand how that's a better source for names than what I've already proposed. As there is no canonical single source for names, we're all making them up as we go.

FYI, I did pitch a list of names for the base instructions to riscv-isa-manual: https://github.com/riscv/riscv-isa-manual/issues/1638, presuming that is the preferred route. I supposed I should add another column to the table in that issue for the names used by riscv-unified-db, because there are certainly differences.

riscv / sail-riscv

Add human-readable instruction name annotations #520

Test Results