jamesbowman / swapforth

Swapforth is a cross-platform ANS Forth
BSD 3-Clause "New" or "Revised" License
275 stars 56 forks source link

j1a8k doesn't use all available ram #42

Open bmentink opened 8 years ago

bmentink commented 8 years ago

Looks like ram is still 8K, I believe the ICE40HX8k fpga has 32K?

RGD2 commented 8 years ago

There are unused ram blocks, yes. I was thinking of using some to implement the deeper parts of the stacks for the j4a, to try to free up some of the LUT's. (The big pipelined quad-stack structure in the j4a seems to be eating the most area).

I've also been thinking on and off of a 24 bit Jx varient to directly allow for an even 20 bits of ram address: using 24x 1M (or 16x1M + 8x1M, or 3x8x1M) in external sram chips. Unfortunately, it quickly gets expensive. And although I like the concept - I've never actually run out of ram with any practical applications of the j4a. Yet, at least.

So yes, there is ram free - afaicr about half the ram in the 8k is unused - but using it as directly accessible ram requires changing the instruction format, or increasing the data word width.

I'd prefer to leave it free to implement some application-specific hardware fifos for exchanging data between threads or between peripherals. (Or perhaps crossing clock domains).

If I had SDRAM on board, I'd just use it as a very deep FIFO on the way to a USB 2.0 Hi-Speed Fifo interface chip. (They work much better with at least 8 MiB or so of fifo, rather than the 2-4kiB you usually get).

But doing that also requires srams available to muster data for bursts into/out of the SDRAM chip, or for covering its refresh cycle unavailability.

I have used that sort of setup to maintain continuous 30MB/s captures from banks of ADC for upto about 8 hrs at a time. (It seemed to work ok for about 99 hours during testing. But I only had 11 TiB of disc space...)

I'm really not at all keen on using SDRAM for system RAM. Too much latency, even with cache. It makes the performance of the whole thing inconsistent. J4a is really useful to me precisely because it is consistent. If I need much ram or CPU power, I'll just plug in an SBC or PC, and transfer the data there for processing.

That the j4a only half-fills a 8k chip gives it a lot of flexibility. I like to deploy it with a Linux SBC handy to keep the toolchain accessible. (And to give it easy network-accessibility). This means I'll tend to write verilog peripherals to plug in to suit the specific application. At the moment I have one with two differently-configured SPI hardware peripherals. The faster one runs at 20 MHz, and either can be used in word or byte mode. (The slower 10MHz one does byte swapping because it's used to talk to a CAN bus controller, and CAN bus data is little-endian). I intend to add the peripherals as open source, but this is kinda where keeping git branches (and sub-branches) of swapforth starts making sense. Although it is reasonable I think to push the peripheral module files upstream - just leave them unconnected, except in the downstream branch they originate in.

bmentink commented 8 years ago

@RGD2

Thanks for the explanation. I agree sdram is probably not the ideal. I would prefer more fpga ram being freed up, even if it addressed separately.

My current requirements are to implement a fast DDS circuit in verilog (interfaced to swapforth), so I need a 512x10bit lookup table for arbitary waveforms. I don't want to use up any of the remaining 3k of ram for that if possible, as I want to write substantial Forth programs too ..

By The way, can you explain how to use the current ram for that? I don't understand the ram.v module at all, or how the python program generates it from the .hex file ...

Also, is there any documentation for the j4a cpu? I am guessing that it has 4 x simultaneous hardware threads? I can't find any description of it .. may be useful to me as well.

Having open source shareable verilog modules is a great idea. I have some PWM modules with pre-scalers etc I could share as well.

RGD2 commented 8 years ago

The j4a only has four sets of stacks - they round-robin through the ram and alu, which prevents concurrency problems.

The arrangement means the alu could be pipelined, and that's where the speed up would occur. But it would likely never run one thread any faster than a j1a, because of the critical delay path being ultimately just as long.

At the moment, each thread runs 1/4 as fast as a j1a, but pipelining should allow increasing the clock rate to 160 MHz, which would make each thread Exactly As fast as a J1a. The ram is capable of much higher rates, so it's not the bottleneck, and the stacks can also be pipelined, so they shouldn't be either. Ditto for the io interface, which if pipelined could accomodate a proper address decoder to allow much increased IO space for hanging peripherals off of, at the expense of a little more latency.

The j4a is compatible with the j1a, in the sense that one can use the j1a simulator (make bootstrap) to compile swapforth for the j4a.

bmentink commented 8 years ago

Got it, thanks. Having 4 threads run as fast as the j1a would be awesome! Having the clock at 160Mhz would help me too, as I would like to clock other fpga peripherals at greater than 100Mhz ..

Did you have any thoughts about my waveform table in fpgs ram?

RGD2 commented 8 years ago

I'm currently doing something similar - except using the j4a to spit out samples to a SPI connected DAC, but only at 2 kHz. (Could have gone much faster of course, but that was the spec I was given. )

I just copy paste the values from a spreadsheet, where the column had been formatted int "$xxxx , " (without the quotes, but note the spaces). Then drop them into a text file I can #include. The comma is a forth word which compiles what was on the stack to the end of the dictionary. Use the forth word create with a name first to put a word of that name down which gives you the address of the start of the array. Then immediately do the #include. I have it scripted: you can put

include in files you #include, and it works as you expect. (This is how I

do a "clean" Build of an app : usually using make sim_connect to run the top level include so I can avoid having to build the FPGA more than once, or having to simulate the full j4a).

I also keep track of the number of values with a constant, and the code that accesses the data just does pointer arithmetic then a @.

There are other, more advanced ways to do this with forth. Don't forget you could load/reload at runtime as well : one idea is to include a raspi to run the UI as a web app, and have it load snippets of forth at run time. (Thus avoiding needing everything to fit in j1 ram at all times.). YMMV though. Don't forget the icoboard exists, and has SRAM on board.... It would happily sit on a Pi.

-- Remy

bmentink commented 8 years ago

Hi,

Yes understand doing it in Forth. But, I want to spit out samples at 48Mhz so have to use verilog ..

I would like to know how to format the data to include in the ram arrays the same way the forth code has bben included in the binary image, but I don't understand how the mkrom.py creates the data that is included in ram.v. (not sure how the data is split among the addresses cells) ... not too familiar with python.

Any help there would be great ..

On Wed, Oct 5, 2016 at 5:30 PM, RGD2 notifications@github.com wrote:

I'm currently doing something similar - except using the j4a to spit out samples to a SPI connected DAC, but only at 2 kHz. (Could have gone much faster of course, but that was the spec I was given. )

I just copy paste the values from a spreadsheet, where the column had been formatted int "$xxxx , " (without the quotes, but note the spaces). Then drop them into a text file I can #include. The comma is a forth word which compiles what was on the stack to the end of the dictionary. Use the forth word create with a name first to put a word of that name down which gives you the address of the start of the array. Then immediately do the #include. I have it scripted: you can put

include in files you #include, and it works as you expect. (This is how I

do a "clean" Build of an app : usually using make sim_connect to run the top level include so I can avoid having to build the FPGA more than once, or having to simulate the full j4a).

I also keep track of the number of values with a constant, and the code that accesses the data just does pointer arithmetic then a @.

There are other, more advanced ways to do this with forth. Don't forget you could load/reload at runtime as well : one idea is to include a raspi to run the UI as a web app, and have it load snippets of forth at run time. (Thus avoiding needing everything to fit in j1 ram at all times.). YMMV though. Don't forget the icoboard exists, and has SRAM on board.... It would happily sit on a Pi.

On Wednesday, 5 October 2016, bmentink notifications@github.com wrote:

Got it, thanks. Having 4 threads run as fast as the j1a would be awesome! Having the clock at 160Mhz would help me too, as I would like to clock other fpga peripherals at greater than 100Mhz ..

Did you have any thoughts about my waveform table in fpgs ram?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jamesbowman/swapforth/issues/ 42#issuecomment-251565195, or mute the thread https://github.com/notifications/unsubscribe-auth/AO8- GFZx9F4xiLCbhWAqZwIpufOIFFSXks5qwwh4gaJpZM4KM2kw

.

-- Remy

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jamesbowman/swapforth/issues/42#issuecomment-251579177, or mute the thread https://github.com/notifications/unsubscribe-auth/AJp6h9-d61skRUydqmiwJMUslHxw5Fw1ks5qwyfYgaJpZM4KM2kw .

RGD2 commented 8 years ago

Hmm, might have to ask James, I don't really either. But it's a bit of a kludge: it's possible now to change block SRAM in a FPGA bin file directly with icebram - it wasn't around when mkrom.py was written, and it's much quicker to change the bin file than it is to recompile the whole thing. So, I'd suggest looking into that first.

bmentink commented 8 years ago

Thanks, will look into icebram. I also thought of creating the data as 16bit hex words, swapping the bytes, and adding/replacing the last block of values in nuc.hex (top of ram) with the contents. My Verilog module would then spit out the block to a 16bit DAC, I have some Verilog DDS code to do all that ..

bmentink commented 7 years ago

Any further idea's how I can implement the sine table I need in FPGA? I have no idea how to address it from verilog, even if I put it top of current RAM.

RGD2 commented 7 years ago

Write a little verilog machine to generate it as you will without swapforth entirely - starting from one of the example designs.

Then, when that works as expected, modify j1a.v to include the thing and include controls to allow your j1a instance to control it via io! and io@ .

The block rams you'll add will be completely separate from the j1a's ram. See lattice's documentation for how the different EBR primitive blocks work in verilog. There's a primitive for 256 words of 16 bits, if you want more than that, you'll have to combine multiple blocks together. eg: two 512x8's which each get the same address inputs, and whose outputs are concatenated to give your your 16bit data.

Obviously, combined this way, you'll have to 'deinterlace' your sine table into two 512 entry 8 bit tables - and then put them in the relevant spots. You could just use some dummy values in the verilog init streams, and then use icebram to extract all brams' from the fpga .bin file, so you can figure out which block is which, then use python to put the proper wavetables back into a format where icebram will accept them, to overwrite what's in the .bin fpga bitfile.

Mecrisp commented 7 years ago

I found a way to address the whole RAM available in HX8K - you can move the "fetch" bit out of the address space directly accessible with call/jmp/jnz and do a RAM fetch explicitely by or'ing/add'ing the high fetch bit and pass it to execute. Ok, it will render the sequence "variable @" longer, not only a single high-call opcode, but a literal and a "call execute", however, this is compensated by double the amount of RAM available. Regarding your other questions, I wired in the 16x16=32 multiplier as two different opcodes for the low and hight 16 bit part of the result. I also replaced do-loop with a stack only version, as James original variant involving a local variable was not interrupt safe. You can see my modifications to Swapforth in the current Mecrisp-Ice package on mecrisp.sourceforge.net Best wishes, Matthias