Closed msperl closed 8 years ago
Are they changes that affect other modules, or strictly internal implementation details?
In general, I wouldn't worry about it. We frequently have to merge upstream changes with downstream patches, and the fact that you have contributed both (and thank you for that, Martin) doesn't really change that.
Having said that, if you think the merge may not be trivial then you could send us a patch that gets us from the top of our tree to what you think the final code should look like, and then we can check post-merge that we end up in the right place.
I would take care to upstream patches the same way you did - with the exception that I may move the dma pool in earlier (as I need to merge some of the cyclic and slave code into a single shared piece of code) - anyways: I would do that slightly differently...
My biggest problem is: I can not test the cyclic dma code - no experience how that works and what HW I would need
An external audio HAT is probably the best way to test cyclic DMA. Several purveyors of fine audio cards frequent these parts - @hifiberry and @iqaudio come to mind.
@msperl Contact me at support@hifiberry.com if you need a board for testing purposes.
@HiassofT may also be able to help with testing.
The biggest thing is the learning-curve of i2s (which obviously requires some of my time)... @hifiberry: contacting you - I hope it does not require that many jumper-cables, as my dev board is a CM... @HiassofT may do some independent testing to see that it works...
I will send you the DAC+ Light, this one needs only 3 data lines + 5V + GND, no additional I2C connection is needed for this one.
thanks
@msperl You can also ping me with patches if you want any testing done in this specific area. I have a test group comprising a Zero, CM, A+, B+ and 2B, which each have I2S DAC attached, outputting audio 24/7..... Currently running downstream 4.3 plus backport of the dma pool patch from downstream 4.4rc, but can be changed to whatever you want tested.
Who needs a board? Happy to ship worldwide :-)
Seasons greetings,
Gordon@iqaudio.com
On 18 Dec 2015, at 08:44, Phil Elwell notifications@github.com wrote:
An external audio HAT is probably the best way to test cyclic DMA. Several purveyors of fine audio cards frequent these parts - @hifiberry and @iqaudio come to mind.
— Reply to this email directly or view it on GitHub.
@msperl just drop me a line if you need some help with testing or if you have questions about the I2S code
Note that there's an unfixed bug in the cyclic DMA code, it would be good to resolve this before upstreaming the code.See #1193
As for testing with upstream: The hifiberry DAC+ light seems to be a good candidate as it has very few external code dependencies. I guess you should get it working if you add the hw_params function plus the .ops definition from the downstream driver to the test code I wrote for testing the dmapool patch http://lists.infradead.org/pipermail/linux-arm-kernel/2015-November/387442.html
FYI: here's the bcm2835-dma code I'm currently testing with kernel 4.3: https://gist.github.com/HiassofT/95f073242f19299b4da0
It includes the dmapool patch and I've removed the incorrect period-splitting/rearrangement code from bcm2835_dma_prep_dma_cyclic.
If a period larger than the maximum supported length is requested it errors out. Likewise if the buffer length is not a multiple of period length. This matches the implementation of most other DMA drivers, eg dma-axi-dmac, imx-sdma, mxs-dma - and simplifies the code quite a bit.
So far I haven't run into any issues with these changes, the only requirement is that period_bytes_max in bcm2835-i2s is set up correctly (to 64k-4 bytes).
@HiassofT Now running your latest bcm2835-dma code on 3 of my I2S test group. All looks good so far. I'll switch the others to it tomorrow.
@HiassofT: Reviewing existing patches so that I can argue them (playing a bit of devils advocate):
@HiassofT: Reviewing existing patches so that I can argue them (playing a bit of devils advocate):
That's a good idea, let's take the code apart and see if we can find more skeletons in the closet :)
why do we switch to dmapool implementation? What is the advantage/arguments/measurements supporting those arguments?
Since kernel 4.3 dma_free_coherent complains loudly when it's beeing called from interrupt context - which happens when an audio device is closed. Switching to dmapool is one solution to that problem. See discussion here: https://github.com/raspberrypi/linux/pull/1178#issuecomment-154164418
why is there a different cyclic limit for cyclic DMA channels - I see nothing obvious? Would not: "#define MAX_LITE_TRANSFER (SZ_64K - SZ_K)" work more efficiently? (I can guess about the - 4 byte case - 2 transfers - one of 65532 bytes and one 4 bytes long with the corresponding CB overhead of 64 bytes to load for the control block.)
That was a temporary quick-fix to get rid of clicks during audio playback starting with kernel 4.2. See here: https://github.com/raspberrypi/linux/pull/1153#issuecomment-147090013
This quick-fix just masked the underlying bugs in the code, but it took a while until I found out what exactly was going wrong.
The cyclic DMA code really shouldn't mess around with the parameters. The userspace application chooses buffer and period sizes within the limits reported by the PCM driver (bcm2835-i2s) so that it can meet it's latency etc constraints - and it relies on the PCM/DMA driver to operate as requested. If it fails to do so you may get buffer over/undderruns, manifesting in clicks or dropouts during recording and other nasty stuff - IOW it may break audio.
I've just PRed my proposed fixes for the 4.4 branch, the bcm2835-dma code is identical to the 4.3 one I uploaded to gist (modulo some whitespace/comment diffs). See here: #1233
Other comments on commits in foundation 4.4.y:
So my approach would be:
I think https://github.com/raspberrypi/linux/commit/f02f5b8d103f39629e5e27c11c8616614e1bba7f can be dropped. We were using it for debugging sdcard errors as previously increased dma wait states had worked around an sdcard driver bug. However it is not known to help with current driver.
In the meantime I found that i2s does no longer work upstream because of the new clockmanager by @anholt - I wonder when this does filter down into the foundation kernels...
Does rpi-4.4.y have the i2s problem? It builds clk-bcm2835 for downstream kernel. I believe i2s audio is working in Milhouse OpenELEC builds which use rpi-4.4.y.
Yes, we are likely to get any upstream patches eventually, so identifying any upstream commits that break things we care about and reporting them would be good.
Does rpi-4.4.y have the i2s problem? It builds clk-bcm2835 for downstream kernel.
i2s cards currently work fine in rpi-4.4.y.
I ran into the same problem when testing the dmapool patch with upstream 4.4-rc2 and had to disable the clk driver in DT. See http://lists.infradead.org/pipermail/linux-arm-kernel/2015-November/387442.html
I know, but as this effort is to upstream the dma code we need also to fix that portion - otherwise we can not test it...
as @popcornmix said: we need to fix those anyway to avoid the roadblocks...
b59b3cb4d1cbe396d439f76abfcfbee786fcce97 is reverted in my proposed fix #1233 (I didn't add an explicit revert commit).
Instead of looking at the single diffs it's easier to compare the resulting file. I did this for the cyclic setup code and compared upstream+dmapool patch to downstream plus my proposed fix. The end result is basically just added sanity checks, most of the other downstream changes are reverted. https://gist.github.com/HiassofT/639a2069d56b2874fce9
I think we may drop b12e830e429106a3457cf285c7071dfb51096e87, that was only needed because the dma channels hadn't been defined in DT. I've created #1235 to clarify this.
2bc2902a664d66f4eeb83bb94d55e45d2e00e1fa changes the slave_sg code to use dmapool as well.
Regarding the calltrace: IRQs are disabled in the alsa code, snd_pcm_period_elapsed (in sound/core/pcm_lib.c) does a snd_pcm_stream_lock_irqsave
While patching the i2s driver to use the clock framework I found that the clock-framework is also "unhappy" about disabled interrupts and complains with WARN_ON.
At least I found that there is a patch for the new clock-framework that also supports PWM - I need to check that when I get back and find a solution for the clock_framework clk_disable - the easiest solution is to keep the clock running when the driver is loaded or we need to defer the disable...
Alternatively we need to clk_disable in a tasklet and at that time we can also abort the dma transfers. That would remove the need for the dmapool.
Actually I would guess this also would show up with other SOC implementations as well...
I'm wondering if moving clock management to DAPM might be a solution. dapm_clock_event could be what we are looking for, and it'll be called from a different context. http://lxr.free-electrons.com/source/sound/soc/soc-dapm.c#L1211
I'm totally unsure though if this is correct at all (the only user of SND_SOC_DAPM_CLOCK_SUPPLY in the kernel is a codec, not a CPU DAI) or if it would open a can of worms - DAPM can be tricky.
Let me see how far I get next week when I am back - right now I get an exception when I use clk_set_rate - it may also be the clock driver patch I created or something else...
Ok - after some reading and some coding one insight into why there are audiable "gaps" during transfers (b59b3cb)...
The reason is that we produce 2 control-blocks with:
note that I got an initial version that (hopefully) is more acceptable to upstream by sharing code between cyclic and slave_sg...
the essential portion (leaving out some definitions) of slave_sg now looks like this:
static struct dma_async_tx_descriptor *bcm2835_dma_prep_slave_sg(
struct dma_chan *chan,
struct scatterlist *sgl, unsigned int sg_len,
enum dma_transfer_direction direction,
unsigned long flags, void *context)
{
struct bcm2835_chan *c = to_bcm2835_dma_chan(chan);
struct bcm2835_desc *d;
dma_addr_t src = 0, dst = 0;
u32 info = BCM2835_DMA_WAIT_RESP;
u32 extra = BCM2835_DMA_INT_EN;
size_t frames;
if (!is_slave_direction(direction)) {
dev_err(chan->device->dev,
"%s: bad direction?\n", __func__);
return NULL;
}
if (c->dreq != 0)
info |= BCM2835_DMA_PER_MAP(c->dreq);
if (direction == DMA_DEV_TO_MEM) {
if (c->cfg.src_addr_width != DMA_SLAVE_BUSWIDTH_4_BYTES)
return NULL;
src = c->cfg.src_addr;
info |= BCM2835_DMA_S_DREQ | BCM2835_DMA_D_INC;
} else {
if (c->cfg.dst_addr_width != DMA_SLAVE_BUSWIDTH_4_BYTES)
return NULL;
dst = c->cfg.dst_addr;
info |= BCM2835_DMA_D_DREQ | BCM2835_DMA_S_INC;
}
/* count frames in sg list */
for_each_sg(sgl, sgent, sg_len, i)
frames += bcm2835_dma_frames_for_length(c,
sg_dma_len(sgent));
/* allocate the CB chain */
d = bcm2835_dma_create_cb_chain(chan, direction, false,
info, extra,
frames, src, dst, 0, GFP_KERNEL);
if (!d)
return NULL;
/* fill in frames with scatterlist pointers */
bcm2835_dma_fill_cb_chain(chan, direction, d->cb_list,
sgl, sg_len);
return vchan_tx_prep(&c->vc, &d->vd, flags);
}
and the corresponding cyclic code:
static struct dma_async_tx_descriptor *bcm2835_dma_prep_dma_cyclic(
struct dma_chan *chan, dma_addr_t buf_addr, size_t buf_len,
size_t period_len, enum dma_transfer_direction direction,
unsigned long flags)
{
struct bcm2835_chan *c = to_bcm2835_dma_chan(chan);
struct bcm2835_desc *d;
dma_addr_t src, dst;
u32 info = BCM2835_DMA_WAIT_RESP;
u32 extra = BCM2835_DMA_INT_EN;
if (!is_slave_direction(direction)) {
dev_err(chan->device->dev,
"%s: bad direction?\n", __func__);
return NULL;
}
if (!buf_len) {
dev_err(chan->device->dev,
"%s: bad buffer length (= 0)\n", __func__);
return NULL;
}
/* Setup DREQ channel */
if (c->dreq != 0)
info |= BCM2835_DMA_PER_MAP(c->dreq);
if (direction == DMA_DEV_TO_MEM) {
if (c->cfg.src_addr_width != DMA_SLAVE_BUSWIDTH_4_BYTES)
return NULL;
src = c->cfg.src_addr;
dst = buf_addr;
info |= BCM2835_DMA_S_DREQ | BCM2835_DMA_D_INC;
} else {
if (c->cfg.dst_addr_width != DMA_SLAVE_BUSWIDTH_4_BYTES)
return NULL;
dst = c->cfg.dst_addr;
src = buf_addr;
info |= BCM2835_DMA_D_DREQ | BCM2835_DMA_S_INC;
}
/* allocate the CB chain */
d = bcm2835_dma_create_cb_chain(chan, direction,
info, extra, true,
0, src, dst, buf_len, GFP_KERNEL);
if (!d)
return NULL;
/* wrap around into a loop */
d->cb_list[d->frames - 1].cb->next = d->cb_list[0].paddr;
return vchan_tx_prep(&c->vc, &d->vd, flags);
}
In addition the dma_memcpy operations come for almost free:
struct dma_async_tx_descriptor *bcm2835_dma_prep_dma_memcpy(
struct dma_chan *chan, dma_addr_t dst, dma_addr_t src,
size_t len, unsigned long flags)
{
struct bcm2835_chan *c = to_bcm2835_dma_chan(chan);
struct bcm2835_desc *d;
u32 info = BCM2835_DMA_D_INC | BCM2835_DMA_S_INC;
u32 extra = BCM2835_DMA_INT_EN | BCM2835_DMA_WAIT_RESP;
/* if src, dst or len is not given return with an error */
if (!src | !dst | !len)
return NULL;
/* allocate the CB chain */
d = bcm2835_dma_create_cb_chain(chan, DMA_MEM_TO_MEM, false,
info, extra, 0,
src, dst, len, GFP_KERNEL);
if (!d)
return NULL;
return vchan_tx_prep(&c->vc, &d->vd, flags);
}
Similarly memset/memset_sg would also be easy to implement and probably interleaved as well. The question is if there are any "consumer" for those...
In the hope that this may be acceptable for upstream...
I still got I2S issues with the I2S clock so I can not verify if the cyclic code is working as expected (yet)...
bcm2835_dma_prep_dma_cyclic doesn't look correct, it doesn't seem to honor the period_len parameter.
See include/linux/dmaengine.h:
@device_prep_dma_cyclic: prepare a cyclic dma operation suitable for audio.
The function takes a buffer of size buf_len. The callback function will
be called after period_len bytes have been transferred.
Having a common function for CB allocation and basic setup makes sense, but we'll need an address/len setup function for cyclic DMA as well.
well, that is something I have missed... Thanks for pointing it out - I found it strange that I could do away with period_len...
still that "split I was talking about - especially about interrupts when you have one transfer that is only 4 bytes in size - seems relevant...
still that "split I was talking about - especially about interrupts when you have one transfer that is only 4 bytes in size - seems relevant...
Is this an issue with SG DMA as well?
For cyclic / audio DMA period-splitting isn't necessary IMO.
In my Cirrus Audio card tree I've set period_bytes_max in the I2S driver to 64k-4 (thus avoiding period splitting) more than a month ago and haven't heard any complaints - and it's working fine here in my tests as well.
https://github.com/HiassofT/rpi-linux/tree/cirrus-4.1.y http://www.element14.com/community/message/167412/l/re-driver-fixes-and-updates-to-kernel-31816-and-405#167412
I got a first working version, but for that I had to patch up the sound driver itself (plus a pr_info in the clock driver) so that It gets the address of the clock registers to control via a module parameter.
Obviously this is not acceptable, but fixing the driver to use the (new) clock-framework is not straight-forward...
My tests playing BigBunny on a SPI display plus I2S sound on the HifiBerry (both using DMA) seems to work as expected.
Here the interrupt counters showing that the interrupts are working fine:
root@raspcm:~# cat /proc/interrupts
CPU0
27: 43141 ARMCTRL-level 35 Edge timer
33: 4724186 ARMCTRL-level 41 Edge 20980000.usb, dwc2_hsotg:usb1
44: 15376 ARMCTRL-level 52 Edge DMA IRQ
45: 0 ARMCTRL-level 53 Edge DMA IRQ
48: 10617 ARMCTRL-level 56 Edge DMA IRQ
49: 10617 ARMCTRL-level 57 Edge DMA IRQ
73: 0 ARMCTRL-level 81 Edge 20200000.gpio:bank0
74: 0 ARMCTRL-level 82 Edge 20200000.gpio:bank1
78: 0 ARMCTRL-level 86 Edge 20204000.spi
81: 292 ARMCTRL-level 89 Edge uart-pl011
86: 229575 ARMCTRL-level 94 Edge mmc0
Err: 0
I will share the patches a bit later, as I need to start separating the patches out into separate commits to become acceptable.
Note here is the cumulative patch against upstream 4.4.0-rc6: https://github.com/msperl/linux-upstream/commit/2f67369d4368539483af43b3b89d264c6029ec09 I will start to split it into the corresponding chunks now...
Here the split version that comprises 5 patches: https://github.com/msperl/linux-upstream/commits/4.4.0-rc6-dmaengine-split maybe some of you find some time to test with other devices as well...
If I get positive feedback from you I will (try to) upstream this patchset...
Note that I have not ported the patchset back to the foundation kernel - I guess there will be conflicts...
@msperl Any chance you can provide a patch set on top of downstream rpi-4.4.y? Please.
it is not as easy as that because of all those extra patches applied to downstream...
@msperl Understood. Unfortunately, until 2836 is usable with upstream kernel, (which if I understand, based on recent messages from Eric on LKML, should happen in 4.5 timeframe), .........
@HiassofT I'm not sure where things are going to go from here, with the work you have done, compared to what Martin is doing for upstream, but just wanted to let you know that I haven't had any issues with your latest dma code. Over the Christmas period it was run (heavily tested) on various, (17 at last count), 2835/2836 hardware.... Zero, B, B+ and 2B's.... All with various different I2S devices attached, ranging from the run-of-the-mill PCM5102/512x "usual suspects" to the more exotic.....
Tested at all supported sample rates (depending on device) 44k1 through 192k, with 16/24/32 bit word lengths, and various ALSA buffer/period sizes......
I'm currently applying these 5 patches on top of rpi-4.4.y..... raspberrypi-kernel-bcm2835-dma_supports_a_maximum_transfer_length_of_64k-4_bytes.txt raspberrypi-kernel-Revertbcm2835-dmaLimit_cyclic_transfers_on_lite_channels.txt raspberrypi-kernel-bcm2835-dma__fix_calculation_of_cyclic_DMA_frames.txt raspberrypi-kernel-bcm2835-dma_HiassofT.txt raspberrypi-kernel-Revertbcm2835-dmaFix_dreq_not_set_for_slave_transfers.txt
@pelwell, @popcornmix: how would we want to run this "merge" in the end - there will definitely be quite a few conflicts, so I guess we may want to come up with a process right now before I try to create a patchset...
I guess it means essentially rolling back to the upstream version and create patches we will also upstream (I guess besides that legacy support thingy of 2D DMA inside X11 - until Erics KMS patches go in)
I guess it applies cleanly against 4.1 (I have not tested it yet), but 4.4 seems more work as there are lots of custom patches...
Here the patches:
So most of them can get removed without any impact (besides the early/legacy support patch), which should minimize the issue...
As for the out of tree patches by @HiassofT: I have not reviewed those, but I guess that lots of things would be replicated in the new patchset for upstream...
Note that I did use the latest 4.4.x and reverted all of the above patches (the ones that come with 4.4.X) and the patchset applies without any issues...
It compiles as is, but I have not done any functional tests.
The only thing lost is early initialization of DMAengine and legacy DMA support for the framebuffer, But - as said these can get added again (but I would recommend as separate patches)...
One note: with this patch it should be possible to set the period_len to any value - 64K (not 64K-4 or similar) or 128K should work as well - the big advantage here is reducing the number of interrupts...
I will provide a branch on github when the push has finished...
Assuming these patches will appear upstream, then the pi patches will eventually be rebased on top of upstream. So reverting any conflicting patches, then applying your patches, then applying any of the reverted patches we still need(*) would be the right sequence.
Once patches appear upstream, we'll just be left with the (*) patches in our tree.
As it seems the reverted patches will need some modifications to make them work - especially we will need to split 169eb5a - one patch has been added to the above branch to enable compilation - the others will need to follow and I am unsure if we have a good argument for any of those to get upstream.
We could try "early initialization" but people would ask about which driver would be using this early - unless we use dmaengine with the SD card driver and get that upstream...
The legacy support patch (169eb5a) can now get replaced by:
(not tested yet, but it compiles)
The one interesting fact is that the original patch allows the use of DMA channel 2 and 3.
At least it is not masked out explicitly - I wonder why this is the case, as this may mean we can remove these limitations upstream as a whole (because there is no bcm2708_fb and thus no need for legacy-DMA support), but for that we need to understand why we needed to mask those out at all, and what has changed since then...
Maybe someone has got any idea?
The gpu informs kernel of the unused (by gpu) dma channels. Currently that is dma.dmachans=0x7f35 (binary 0111_1111_0011_0101) So kernel can use channels 0,2,4,5,8,9,10,11,12,13,14.
So kernel shouldn't be using 3. It can use 2 (typically I think sdcard uses 2). The fb acceleration typically uses 0 (as that has larger buffers for faster memcpy).
That's correct, but the driver can't cope with the fact that channels 12, 13 & 14 share an interrupt, so the list of usable DMA channels is 0, 2, 4, 5, 8, 9, 10, 11 (notice that brcm,dma-channel-mask is 0x0f35 and that there are only 12 entries in the interrupts list - channels 0-11).
I don't know why @koalo put that restriction in the mainline driver and left it out in his downstream version: https://github.com/raspberrypi/linux/blob/rpi-3.10.y/drivers/dma/bcm2708-dmaengine.c Maybe because the downstream version used the legacy API to allocate channels, which got the channel mask from the firmware through the kernel commandline: dma.dmachans.
@popcornmix dma.dmachans isn't used by the bcm2835 driver only the DT property. I don't know if the firmware/bootloader sets that. We talked about it, but I don't know if it was implemented.
So I wonder why we still have this additional mask for upstream kernels. Not that the DT already contains:
mmc@7e300000 {
dmas = <&dma 11>,
<&dma 11>;
dma-names = "tx", "rx";
};
sdhost: sdhost@7e202000 {
dmas = <&dma 13>,
<&dma 13>;
dma-names = "tx", "rx";
};
so the drivers are using the dma-engine correctly and we do not need to mask them out...
as for the shared IRQ for 12-14 - the definitions in the code are:
#define BCM2835_DMA_BULK_MASK BIT(0)
#define BCM2835_DMA_FIQ_MASK (BIT(2) | BIT(3))
so these do not apply to 12-14.
The whole discussion shows that we do not need to map out anything UNLESS we use the legacy-API, in which case we assign it DMA-channel 0 via the BULK_MASK. So this means we can remove this restriction from upstream.
The remaining open point is still the "shared" interrupts, which - as of now - we need to mask out instead - so we need a different mask unless we can get the firmware to use one of those instead...
I guess we could use one of 12 to 14, but not all - or at least we may not use interrupts on those - a project for later (SPI could be using one of those channels for TX, as we only need the callback trigger on RX)...
So I create a patch for that to go upstream...
The firmware does put a value in the DT, but in the wrong property (/axi/dma/broadcom,channels). Unfortunately the driver isn't even smart enough to know that channels 12-14 are special, and adding that knowledge to the firmware is just wrong, so I ended up with a hard-coded restricted value in the DT.
Thanks for all the explanations and reminder of how the workarounds were implemented.
In summary:
So for legacy support we need this:
#if defined(CONFIG_DMA_BCM2708) || defined(CONFIG_DMA_BCM2708_MODULE)
/* Do not use the BULK channel (chan0) - it is used by legacy dma */
chans_available &= ~BCM2835_DMA_BULK_MASK;
#endif
and we forget about the FIQ MASK.
@HiassofT, @clivem: The branch mentioned above contains now the proposed upstream patchset - this can get used for some testing - I did check some of those additional patches and they look as if the things they fix are already patched in the new patchset. See also the note about the cyclic DMA and period_len that can be bigger than 65532, so you may want to test that as well.
I am trying to upstream drivers/dma/bcm2835-dma.c - especially the slave-portion.
One of the thing that turns up is that upstream wants changes to the code, which I can do.
The question is: how would it filter back to this branched code how can I help making that work smoothly?