gyurco / MiSTery

Atari ST/STe core for FPGAs
39 stars 11 forks source link

DMA start unreliable #14

Closed harbaum closed 8 months ago

harbaum commented 8 months ago

This line is a little suspicious to me:

https://github.com/gyurco/MiSTery/blob/5614b859b038adadb3661ab0060256413e78c9db/atarist/dma.v#L307

as it will only start the DMA with the write pointer being 8 or 0. This means that there must be exactly 8 entries in the FIFO or otherwise DMA would not start. But this happens if the FIFO is being re-filled while writing to memory. Then the buffer can have up to 15 valid entries and the DMA would not start.

IMHO, this should look like this:

wire [3:0] fifo_fill = fifo_wptr - fifo_rptr;   
wire fifo_full_8 = fifo_fill >= 8;  
wire fifo_read_start = dma_in_progress && !dma_direction_out && !fifo_read_in_progress && fifo_full_8;

I've changed this in my variant and don't see any negative effect. But I can now just fill the FIFO up as I wish, while the previous version would then block.

The write case already seems to work that way.

gyurco commented 8 months ago

AFAIK this is how the original DMA chip work. It won't start a transfer, if there's not at least 16 bytes in it. Look at this PDF, page 10, DMA Programming Tips and idiosyncrasies: http://info-coach.fr/atari/documents/_mydoc/FD-HD_Programming.pdf

harbaum commented 8 months ago

Yes. But imho the current version only starts if there are exactly 8 words (16 bytes in it). But the fifo is 16 words deep and it won't start if there are more than 8 words in it. Usually this means that sometime before there were 8 words in it. But when that happened while the DMA was still in progress then the DMA misses the start. At least that's what I see when I run a simulation and fill the fifo very fast.

gyurco commented 8 months ago

That makes sense. However how fast do you fill the FIFO? As the condition is latched at clk (32MHz) rate:

if (fifo_read_start) fifo_read_in_progress <= 1'b1;

IMHO it's not possible to fill it faster than it acts upon the condition.

harbaum commented 8 months ago

I am not sure about this. I simply fill the fifo full speed (32 MHz) from an internal sector buffer and it's emptied by my testbench at 8mhz. This is of course only what my simulation does. I was assuming that the real hardware is even slower. With ram at 250ns and video running I'd assume that 500ns/2Mhz would be a reasonable real life value.

The sector buffer is there since my SD card delivers at 8 MBytes/s and the DMA cannot keep up with that. So I fill the fifo full throttle but make sure I stop writing whenever it's full. I was hoping that this will give the better performance than filling the buffer newer more than 50% which is compatible with the current start/stop implementation.

gyurco commented 8 months ago

Yepp, 8 MB/sec is definitely too much. I read somewhere the maximum transfer rate of ACSI is about 2 MB/s (which is just half of the RAM speed?).

harbaum commented 8 months ago

The thing is that on real ACSI there is the DRQ/ACK handshake making sure the device doesn't overflow the FIFO. I don't think we have something like this, so I just fill the FIFO until it's full.

I don't think there's a guaranteed bandwidth for the DMA as e.g. the blitter or other bus masters may throttle it.

So what is "slow enough"? ACSI may sometimes be able to cope with 2MBytes/s, but what's the minimum data rate it can guarantee. It seems so far the data sources were just slow enough. But that may have been pure luck.

gyurco commented 8 months ago

On MiST, I had to decrease the SPI clock to make it reliable, otherwise data corruption occurred at 24 MHz. There's no ACSI DRQ, as the ARM's SPI hardware couldn't handle it, but it's not hard to add, similarly to the FDC DRQ. Just need to investigate the original DMA chip for it's rate. Is it asserts as soon as there is empty room in the FIFO? Or just at a defined rate, like 2 MHz? I saw somewhere a signal capture for ACSI DRQ/ACK, need to find it again.

harbaum commented 8 months ago

Polling some handshake via SPi will sure slow things down extremely. Having a separate sector buffer inside the DMA is imho the simplest solution.

I opened this issue for documentation and for clarification only. You or someone else might have run into similar issues as me.

gyurco commented 8 months ago

Or just add an external FIFO, and do the handshake with that with a DRQ signal (that's more original HW-like, which I prefer).

harbaum commented 8 months ago

That's true. If I am in the mood I may do that. It'd imho then also be nice to move the fdc behind the DMA. I'll think about that once everything works stable.

gyurco commented 8 months ago

FDC is at the right place, I think. It's like how it's on the original ST(e). But it wouldn't make a big difference, if it was instantiated at the DMA module.

harbaum commented 8 months ago

IMHO, the FDC is "behind" the DMA. At least my copy of "Das Profibuch" suggests so.

gyurco commented 8 months ago

Logically you can say that. But in HW, the DMA chip is independent of the FDC chip, using a separate external bus.

harbaum commented 8 months ago

Uhm ... there it is, right behind the DMA sharing the data bus with the ACSI

image

gyurco commented 8 months ago

Yepp, but if you instantiate it in the DMA module, then it'll look like the DMA chip includes the FDC chip (like they put both in a single ASIC)...while in real HW, you can remove them independently of each other. However it's just a logical thing, doesn't really matter.

harbaum commented 8 months ago

Ok. That's probably a matter of taste. For me a submodule is something that has only interfaces to the parent module and which would not work at all without the parent module. Bascially something you could implement on a physical submodule. You could basically build an ACSI + Floppy + DMA sub board and plug that into the DMA socket. It would imho be elegant if none of the connections between fdc, ACSI and DMA would show up in any of the parent modules. Of course we are dealing with external connections here, so some of them need to be routed to the top. But that's just because it's impossible to do otherwise.

I would e.g. also like to implement a Mega STE compatible 16 Mhz solution as a module that replaces the current 8 Mhz CPU.

But as I said: That's probably just a matter of taste.

gyurco commented 8 months ago

Then I would do a two level top-{dma, fdc} module, but yes, it's a matter of taste.

So you want to implement a CPU with a cache module? That would be interesting. Also to see how it's performing against the 'turbo' bus option.

harbaum commented 8 months ago

So you want to implement a CPU with a cache module? That would be interesting. Also to see how it's performing against the 'turbo' bus option.

Yes. But that's pretty far down on the ToDO list. And this would require to have some 32kBytes of RAM for data and tag cache which I am not sure the Tang Nano 20k still has for me.

Plus there aren't many uses for in my setup as it's mostly meant for gaming and simple programs. But Cubase would sure benefit from it. That alone may be reason enough ...