[Feature request] further documentation on the J1 core versions

higaski commented 3 years ago

I'm trying to wrap my head around how the J1 core evolved over time and what version of the core are featured in which repositories/folders. My current understanding is that starting from the original J1 core two new versions called J1a and J1b were created. Since the J1 repository was updated as well I figure the changes were backported?

Sadly I've hardly any knowledge of Verilog/VDHL and therefor a hard time reading the .v files. I'd appreciate it if someone could point out the key differences to me. E.g. this blog post mentions that the return bit was moved from bit 12 to 7... things like that.

Is there a paper describing the new versions of the core like there is from the original?

RGD2 commented 2 years ago

Well, j1b is 32 bit, whereas j1a is 16. But apart from that:

The j1a is able to run on SRAM's having only pseudo-dual port, whereas the original J1 was designed to run on real dual port SRAMS, which aren't available on all FPGA's.

The J1[b] is all about one-basic-forth-instruction-per-clock, and so needs true dual port SRAM so the first port is always available each clock to read the instruction, and the second port may optionally do a read or a write to ram so as to allow @ and ! words to run in a single cycle without interfering with the next instruction read.

Pseudo dual-port SRAM's can read from an address whilst writing to a different address, but each of the two ports is dedicated to only read or write. With true dual port SRAM, each of the ports can read OR write. It's possible to 'emulate' a true dual port sram with only pseudo dual port blocks, but it costs performance, since you need to clock the sram twice as fast to do that.

ice40 architecture FPGA's only have pseudo-dual port embedded ram blocks, and the j1a was written to run on an ice40hx1k chip.

So to accommodate memory access apart from just reading the next instruction, the j1a core has to have an 'alternate' mode, which is done by setting pc[12] in the program counter: If set, the next 'instruction fetch' is really just the second half of a two-cycle @ instruction started on the previous cycle, else it's a normal instruction fetch. The return stack is used to save the actual next instruction location, so it also gets popped into PC when this happens.

You can see pc[12] concatenated into the instruction decode on line 44 of j1a/verilog/j1.v, and ditto lines 78, 88, 97 and 104, since any behaviour normally depending on instruction decode needs to do something different during the second phase of @.

Of course, this then means there is no need for opcode 8'b011?1100 which j1b needs to have so one can put the second port's read data onto the stack.

Instead that opcode is allowed to be used for a 'minus' op in the j1a, which would otherwise require - to be compiled into a defined word INVERT 1+ + rather than just a normal, single instruction like + does.

The other difference can be seen if you diff j1a/basewords.fs j1b/basewords.fs: j1a has opcodes for 2/ and 2* , whereas j1b instead has opcodes for rshift and lshift : j1b has a full shifter unit rather than just a one-step only.

Minor aside for if you can read C code, but are just coming to grips with verilog:

{ and } are the 'concatenate into a bitstring' verilog operator. So you have 'begin' and 'end' keywords for grouping blocks instead. You are so often working on variable-length strings of bits (vectors, really) in hardware that this language feature alone is 'killer'.
Literals can mention what base the literal is written in, as well as how many bits they are.
- _ in a literal is ignored: just there to break up / group parts to make long strings easier to read.
- |st0 is a single bit which is from bitwise or of all bits in st0, which happens to be the top of the data stack.
- ranges indexing bits in a vector [:] are somewhat like python (C straight out doesn't have them), except inclusive at both ends (python is inclusive for the first but not for the second)
  - also you can reverse a bit vector by just listing the low bit before the high bit in the slice you want.
- if you forget to specify the size of a vector, it will default to 32, and this can have unintended consequences.
{5{1'b1}} is the same as 5'b11111: braces used that way repeat the thing mentioned n times, so {3{2'b01}}==6'b010101 would be true.
the ternary, or = ? : operator works exactly like it does in C.
reg just means something that has to be set in some kind of always block - not that it will necessarily be a register, different from a wire that can be assigned outside of always blocks. Verilog, like C, has its warts.
always @* is asynchronous (combinatorial) code written as if procedural.
the one always @(negedge resetq or posedge clk) is the only part that contains actual registers, apart from those hidden in the stack2 modules. This is also why it's the only part that has the <= assignment operators.
the whole j1 core is written in a very common standard style for writing single-cycle Finite State Machines: assign a bunch of signals to calculate what you want to latch in on the 'N'ext clock, and latch them in at the end in a very simple, single synchronous block.
You can think of all the always / combinational 'wire' assignments as all taking place 'simultaneously': the case statements are multiplexors that just 'choose' the one assignment result that is wanted each clock: All others are still calculated; there's separate parallel hardware for them each, after all. This makes it where instruction decode can happen during the time that each alu result is 'in flight', since the inputs to each depends only on things which are 'ready' right after the clock has changed them. Put another way: no time is wasted on decoding register addresses, so processing can start earlier than in a typical, named-register machine.

pc[12] isn't actually used to address the SRAM: the SRAM is generated in j1a/mkrom.py so that the initial contents can be set at FPGA compile time, so that the FPGA also bootstraps the core at configure time. (This isn't so necessary now that the icestorm tools have the ability to just replace SRAM contents without a recompile, but they couldn't do that back when j1a was written, and it's a neat way to make the FPGA configuration logic do your SoC core's bootstrapping too).

Which is to say that j1a.v which is the 'top' for the j1a core, includes ../build/ram.v, which you won't find except as the template in j1a/mkrom.py.

The highest ram fetch address bit the design uses is code_addr[11] (which is pc[11]), with higher bits ignored. The design has 2^12 = 4096 addresses, but they're stored in two 2048 way blocks, consisting of 8 2048x2 memories to each store 16 bits.

It's a little confusing IMHO, but 'din' in ram.v is the connection flowing data from the RAM to the core, and vice-versa for 'dout'.

Another thing which makes the J1 very fast: note that top of stack st0 and next-on-stack st1 are not actually both stored in the 'stack()modules: this is because you very often want to change both in one cycle, sost0is actually an ordinary register, as ispcanddsp`, the latter only being used to keep track of stack depth.

It makes one realize that pc is really the true top of return stack, and forth words like >r are really writing to 'next on return stack'. Every cycle starts with reading the top of return stack from memory, be it to fetch an instruction or just load TOS from ram.

Stack movements are just encoded as two-bit signed integers in the ALU opcode format - one return stack and one for data stack, although -2 isn't used. I suppose if you had some reason to need popping a double-word in one cycle, you might change the stacks to allow that. basewords.fs defined r-2 but never uses it.

You could in principle have opcodes that operated to replace any number of stack items - you'd just rearrange the core such that the top few logical stack items are also registers, like st0 is, thus allowing the core to potentially update all at once. Handy if you wanted to put an op for m*/ in there!

This makes the J1 design pretty interesting for custom FPGA SoC use, IMHO.

Of course, in practise I've found it much easier to extend the I/O section (in icestorm/j1a.v) to allow just hooking up 'accelerator' units, added to the design on an as-needed basis.

The only 'deep' core modding I did was the j4a, which is kinda 4x j(1/4)a in a sense. Has 4x the context, and 'looks' like a 1/4 speed j1a to the code... until you put the other 'cores' to work (they're logical only, the ALU, SRAM and IO are all shared).

Mainly it just has funky 'stack' modules, with a little bit of pipelining and tuning. It's probably got a bug, but has mostly worked out pretty well for me.

It lets me run multiple dumb spin loop bit-bang IO to control/talk to different chips at different rhythms without any interlocks or glitches. Just for a maximum of 4 'threads', but this is heaps for simple thing like a PID controller.

A nice consequence it has is that you can have a spin-loop based app running and still talk to swapforth over rs232 to get/set variables in SRAM without any timing changes. You can even actively hack / rewrite code for different jobs without upsetting at all ones that are running.

Having no DRAM, no wait cycles, no bubbles and only an 'emergency' interrupt system (to recover crashed cores) is incredibly freeing when you're writing a real-time controller. Kind of like having a RTOS in hardware, only better; since the timing is FPGA-state-machine rock-solid, and interlocks are impossible.

Anyway, the code is so short and beautiful for the J1 cores that 'documenting' them is probably more about learning to read verilog than anything else.

Better to have a single source of truth and all that. But certainly you are free to write your own paper on it ;)

One interesting observation: insn[12] is never actually used, instead, it's completely ignored... in all J1 cores.

There are other parts of the instruction space which are 'available':

J1 uses a 4-bit field to select one of 16 ops, but that could easily be extended to one of 32 ops, since that thirteenth bit is already 'free'... Also, within the 'func' codes in insn[6:4], only 5 of 8 possible combinations are used.

PythonLinks commented 1 year ago

This intermediate level documentation was enormously helpful. I do read and write Verilog, but it takes a lot of work to extract this information from the code. In particular what the J4 does baffled me for months, now I get it. It is a barrel processor. You may wnat to put that sentence near the top. Indeed this whole document could happily go in the README.

I was also a bit confused as to what pseudo dual port RAM does.

Thanks again.

jamesbowman / swapforth

[Feature request] further documentation on the J1 core versions #74