riscv-software-src / riscv-unified-db

Machine-readable database of the RISC-V specification, and tools to generate various views
Other
11 stars 12 forks source link

Common field reference table #203

Open drom opened 2 days ago

drom commented 2 days ago

Currently, the named fields are belong to each instruction and don't carry instruction based semantics.

Here is example of addi instruction with rs1, rd, imm

addi:
  long_name: Add immediate
  description: Add an immediate to the value in rs1, and store the result in rd
  definedBy: I
  assembly: xd, xs1, imm
  encoding:
    match: -----------------000-----0010011
    variables:
    - name: imm
      location: 31-20
    - name: rs1
      location: 19-15
    - name: rd
      location: 11-7

Does it make sense to specify reference from each instruction into common reference table that will carry more detailed information about the field (type, zero/sign extension, addressing mode, source/destination, etc.) Here is how the table could look like:

https://github.com/drom/riscv/blob/master/lib/fieldo.js

This information could be elaborated inside the instruction when flat form is needed.

addi:
  long_name: Add immediate
  description: Add an immediate to the value in rs1, and store the result in rd
  definedBy: I
  assembly: xd, xs1, imm
  encoding:
    match: -----------------000-----0010011
    variables:
    - {$ref: 'fieldo.json/imm12'}
    - {$ref: 'fieldo.json/rs1'}
    - {$ref: 'fieldo.json/rd'}

Also, looking at the large usage counts of some fields, it might be important to know that some set of instructions share semantics of some field.

dhower-qc commented 1 day ago

That's an interesting suggestion, and it would be great to capture more semantic information without having to parse/analyze the IDL code. I'm 100% on board with getting this information in, but just want to think through the best approach.

Generally, we've tried to keep .yaml files under arch bare-bones to minimize the dependencies needed to consume the data in any programming language ('flat' as you say). That's why we've avoided things like YAML anchors/refs or, in this case, JSON references.

So, what to do?

  1. Introduce JSON Reference in .yaml files under arch. We can provide a flat version as a generated artifact with each commit (we should really be doing this anyway).
  2. Use JSON Reference in template files that generate .yaml files under arch. We already do something like this in limited cases where there is a lot of repetition (e.g., the pmpaddrN registers, which are generated from pmpadddrN.layout).
  3. Don't introduce JSON Reference, and just add the 'flat' info to the existing .yaml files. We could programmatically detect common fields, but it wouldn't be explicit in the data anywhere.

I'm inclined to go the route of option 1, but just worry that it could be another obstacle making UDB adoption less likely. Especially if JSON Reference becomes a requirement for anyone contributing to the DB (e.g., extension authors).

This is a good topic to discuss in a sync-up meeting. In the meantime, I'm curious what your thoughts are.

Side note

This data is related to instruction formats (I-type, R-type, etc.), which has come up a few times since it's not currently captured by the database. This was initially intentional -- in my corner of the world, we generally don't care about them, especially since it gets ugly quick with things like the C extension. However, compiler folks like @apazos have been educating me on why it's important to capture it, e.g., for linker relaxation information. It seems like this would be right place to figure out how to integrate the format information.

drom commented 1 day ago

You already have some implicit references from one file to another, like definedBy: I. Implicit links require custom sideband knowledge about the ways to reason components together. Explicit JSON/YAML references is an explicit composition mechanism that is still "human readable/writable" but allows declarative, low-overhead, language agnostic automatic resolution. IMHO we should use $ref

drom commented 1 day ago

This data is related to instruction formats (I-type, R-type, etc.), which has come up a few times since it's not currently captured by the database.

Yes, canonical '*-type' format is an very high level abstraction that is already broken in many places. I don't see much value caring it as is through the machine readable spec. On other hand, the unique set of field types constitutes specific type format. for example:

variables:
    - {$ref: 'fieldo.json/imm12'}
    - {$ref: 'fieldo.json/rs1'}
    - {$ref: 'fieldo.json/rd'}

We can call it {rd,rs1,imm12}-type format.

All C-type formats just automatically get specialized forms:

https://github.com/drom/riscv/blob/master/lib/fieldo.js#L63

AFOliveira commented 4 hours ago

As a follow up on earlier discussion, I'll take the first step towards this and turn the variables name into their own nicknames, if no one has anything against it. After those nicknames are correct, I guess moving to a json schema pass would be just find/replace, correct?

@drom @dhower-qc Do you think that we should keep values like imm12hi, imm12lo separated on the variable list? It seems quite odd because no matter how the encoding is done, it actually is only one variable.