stnolting / neorv32

:desktop_computer: A small, customizable and extensible MCU-class 32-bit RISC-V soft-core CPU and microcontroller-like SoC written in platform-independent VHDL.
https://neorv32.org
BSD 3-Clause "New" or "Revised" License
1.48k stars 205 forks source link

[feature request] Combining SPI and DMA #861

Open jpf91 opened 3 months ago

jpf91 commented 3 months ago

Is your feature request related to a problem? Please describe.

I'm currently trying to find the best performing SPI SD card implementation for the neorv32 controller. Currently FIFO and IRQ seems the way to go. However, SD card access always transfers complete sectors, so at least 512 Bytes. For that, it would be nice to use DMA.

However: 1) Half-Duplex sending via DMA should be possible right now. Configure a source, configure the destination to the TX FIFO and use the SPI_CTRL_TX_NHALF as trigger. 2) Half-duplex receiving is not possible: DMA could be configured with source SPI DATA register, target memory with increment and SPI_CTRL_IRQ_RX_AVAIL. However, read transfers still need to be initiated by writing to the data register, so this requires still writing to that register without the DMA. 3) Full duplex read/writeis also not possible and probably wont work with less than two DMA channels.

Describe the solution you'd like Describe alternatives you've considered

I'm not sure what is the best solution. For 2) alone, a SPI modifier, which just triggers another transfer whenever the RX FIFO is being read, could be used. Then you would have to read 1 element less using DMA, deconfigure this mode and then read the final element manually... Or does the wishbone bus have some way to signal that something is "the last" partial transfer of a larger transfer? Then this information could be used to avoid triggering another transfer for the last element.

For 3 I think 2 DMA channels will be necessary. With 2 DMA channels, usecase 2) is also covered. However, if only reading is required, the additional DMA channel causes more bus transfers. A partial solution would be to implement 1:N transfers from a register in the DMA controller instead of memory, so at least the read request is not required. Still, the extra receive-only SPI mode above would need less bus transfers than any solution with 2 DMA channels.

I wonder what other microcontrollers are doing?

Additional context For the SD-card usecase, half duplex support is fine: After sending a command, the SD card will read 0xFF for an unspecified number of times. One it has prepared the reponse, it will return 0xFE and after that, all data. As the total transfer length is variable, this can't be done in one single SPI request anyway.

stnolting commented 3 months ago

So you want to read whole sector from SD card with minimal CPU interaction, right?

With the current set of features this is what I would (quite naively) do:

I'm not sure what is the best solution. For 2) alone, a SPI modifier, which just triggers another transfer whenever the RX FIFO is being read, could be used. Then you would have to read 1 element less using DMA, deconfigure this mode and then read the final element manually

Hmm, interesting idea... But I am not sure how this could be implemented (with some reasonable amount of logic 😅). Any ideas?

Or does the wishbone bus have some way to signal that something is "the last" partial transfer of a larger transfer? Then this information could be used to avoid triggering another transfer for the last element.

I'm not sure, maybe there are some special burst modes available for this. However, please keep in mind that the internal bus is not Wishbone (even if we have stolen used a lot of it).

For 3 I think 2 DMA channels will be necessary.

Having several channels would be nice. Basically, you could just instantiate another DMA controller and use that for the second channel.

Another option I was thinking about when writing the DMA was to use a "descriptor based" DMA. Basically, you could put all the configurations as "structs" into RAM, chain them via some pointer and the DMA will happily execute them one after another until reaching the end of the chain. 🤔

For the SD-card usecase, half duplex support is fine: After sending a command, the SD card will read 0xFF for an unspecified number of times. One it has prepared the reponse, it will return 0xFE and after that, all data. As the total transfer length is variable, this can't be done in one single SPI request anyway.

Oh, I was assuming a fixed latency... Ok, this makes interfacing a bit more complex..

How about a dedicated SD-card controller? 🤔

jpf91 commented 3 months ago

With the current set of features this is what I would (quite naively) do:

Good idea, thanks! I'd like to keep the SPI FIFO a bit smaller. Right now I just keep all the transfer logic in the interrupt (this is a FreeRTOS project, so switching to the tasks might have some overhead). If this turns out to be a performance issue, I will try your suggestion. (I unfortunately don't have the hardware here right now, so I can't test this right now).

Hmm, interesting idea... But I am not sure how this could be implemented (with some reasonable amount of logic 😅). Any ideas?

I guess a single configuration bit for "RX Poll Mode" could be used. This would indicate that the last value in the TX FIFO is resubmitted again when the RX register is read. However, thinking about this some more, this might be over-engineering.

I'm not sure, maybe there are some special burst modes available for this. However, please keep in mind that the internal bus is not Wishbone (even if we have stolen used a lot of it).

I think I should do some research how other controllers handle this situation :thinking:

Another option I was thinking about when writing the DMA was to use a "descriptor based" DMA. Basically, you could put all the configurations as "structs" into RAM, chain them via some pointer and the DMA will happily execute them one after another until reaching the end of the chain. 🤔

That sound nice, although sometimes I guess independent triggers are also nice.

Some context: We're hosting an FPGA development lab course at my university. In this course, students design an I2S peripheral to interface external DACs for audio playback, integrated with the NeoRV32 SoC. So far, our demo software only generates a sine wave, which is quite boring :wink: As I finally have some time for this, I'm currently implementing the real music player firmware. And in that context, I could make use of another (manually triggered) DMA channel to copy data from the audio buffer to the I2S master :see_no_evil:

How about a dedicated SD-card controller? 🤔

That's probably the best solution, but I'll try to make it work with the SPI interface for now. Maybe one of our students next year could implement such a controller :thinking:

So to summarize, I guess the most general solution would be to just have more (maybe even a configurable number of) DMA channels. With 2 channels the SPI use cases would also be covered.

stnolting commented 3 months ago

So to summarize, I guess the most general solution would be to just have more (maybe even a configurable number of) DMA channels. With 2 channels the SPI use cases would also be covered.

I see the problem, but do we need two independent channels (i.e. two individual bus access engines) or just two individual descriptors (i.e. two sets of configuration bits to describe an actual DMA transfer)? 🤔

andkae commented 2 months ago

just two individual descriptors

A approach like altera could be a also an idea, you define an stack of descriptors in the RAM or as dedicated descriptor area for the DMA

image

stnolting commented 2 months ago

Right, I really like this concept. You could add another field for a "pointer" and then put as many descriptors into RAM as you like and "chain" them to execute several transfers in a row. 🤔

Obviously, the DMA would require additional hardware to load a descriptor from memory.

Do you think this would be a handy feature?

andkae commented 2 months ago

Hi Stephan,

Obviously, the DMA would require additional hardware to load a descriptor from memory.

yes at least the capacity for one complete descriptor, to avoid many bus occupies. But i mean when the system designer decides to use an DMA, then i think he has to take in mind that's in general is a second processor. Therefore are some LEs neccessary.

When you have such mimic, then could be also forwarded the irqs to the DMA directly, that means: DMA-IRQ0 = SPI-TX --> Execute Descriptor Base Pointer A DMA-IRQ1 = SPI-RX --> Execute Descriptor Base Pointer B ....

When the IRQ fires the DMA uses the basepointer and executes the transfer. Then a SPI Transfer would run w/o any CPU interaction.

Do you think this would be a handy feature?

Question to @jpf91: Would it be helpful for your use case?

BR, Andreas

stnolting commented 2 months ago

But i mean when the system designer decides to use an DMA, then i think he has to take in mind that's in general is a second processor.

True, but let's keep it smaller than an actual second CPU core 😅

When you have such mimic, then could be also forwarded the irqs to the DMA directly, that means: DMA-IRQ0 = SPI-TX --> Execute Descriptor Base Pointer A

That would mean you need to store the base addresses of ALL descriptors somewhere inside the DMA, right?

andkae commented 2 weeks ago

True, but let's keep it smaller than an actual second CPU core

For sure :D :D :D

That would mean you need to store the base addresses of ALL descriptors somewhere inside the DMA, right?

Yes, i would expect. when it should takes place in the main memory, then the descriptor load will block the main memory interface, and you have one additional device for memory arbitration.