mailbox read not using epoll can not be cancelled

peteasa commented 8 years ago

@olajep: More a comment than an issue that I have observed.. but I would appreciate your comment on this..

I noticed in epiphany.c the while loop that waits for an interrupt:

while (!atomic_read(&elink->mailbox_maybe_not_empty) || elink_mailbox_empty_p(elink)) { atomic_set(&elink->mailbox_maybe_not_empty, 0); ... elink_mailbox_irq_enable(elink); ... } Once the while loop has started if no mailbox message arrives then the loop will never terminate and can not be cancelled. The mailbox irq will remain enabled because every 100 msec it will be re-enabled.

In my prototype driver I used epoll and I suspect if you did the same then it would be possible to allow the mailbox read to be stopped gracefully as well as avoid a while loop.

If the cores are separately reset before the mailbox message is received then the epoll wait could be cancelled.

Peter.

olajep commented 8 years ago

Hi Peter,

while (!atomic_read(&elink->mailbox_maybe_not_empty) || elink_mailbox_empty_p(elink)) { atomic_set(&elink->mailbox_maybe_not_empty, 0); ... elink_mailbox_irq_enable(elink); ... } Once the while loop has started if no mailbox message arrives then the loop will never terminate and can not be cancelled. The mailbox irq will remain enabled because every 100 msec it will be re-enabled.

Well, it can be cancelled with Ctrl-C though so it's not THAT terrible.

In my prototype driver I used epoll and I suspect if you did the same then it would be possible to allow the mailbox read to be stopped gracefully as well as avoid a while loop.

Agreed, poll() support is needed here. Usually you have read() (w/ O_NONBLOCK, another program might steal the data) and poll() Look at the mailbox_read ioctl as a specialized read().

I will happily accept any patch that implements poll() (then you get poll() / epoll() / select()) in struct file_operations. Unfortunately I don't know when I'll have time to work on this again...

Thanks, Ola

peteasa commented 8 years ago

Hi Ola,

I might have time to help out with this.. but first I need to get a reliable system running. Right now my mailbox test fails with

[  427.200795] Unhandled fault: imprecise external abort (0x1406) at 0x00117038
[  427.207768] pgd = c0004000
[  427.210456] [00117038] *pgd=00000000
[  427.214021] Internal error: : 1406 [#1] PREEMPT SMP ARM
[  427.219223] Modules linked in:
[  427.222267] CPU: 0 PID: 297 Comm: kworker/0:1 Not tainted 4.4.0-parallella #1
[  427.229378] Hardware name: Xilinx Zynq Platform
[  427.233915] Workqueue: events elink_mailbox_irq_handler_bh
[  427.239362] task: ee80a540 ti: ee82a000 task.ti: ee82a000
[  427.244747] PC is at elink_mailbox_irq_handler_bh+0x2c/0x9c
[  427.250301] LR is at elink_mailbox_irq_handler_bh+0x14/0x9c
[  427.255856] pc : []    lr : []    psr: 60010013
[  427.255856] sp : ee82bef0  ip : 00000000  fp : ef814b00
[  427.267310] r10: ef390300  r9 : 00000000  r8 : 00000000
[  427.272520] r7 : ef81a600  r6 : ef814b00  r5 : ef390300  r4 : ef328244
[  427.279029] r3 : deadbeef  r2 : f1280000  r1 : 00000000  r0 : c085a8ec
[  427.285542] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  427.292657] Control: 18c5387d  Table: 2eb7404a  DAC: 00000051
[  427.298386] Process kworker/0:1 (pid: 297, stack limit = 0xee82a210)
[  427.304722] Stack: (0xee82bef0 to 0xee82c000)
[  427.309066] bee0:                                     ef814b00 ef390300 ef328244 ef390300
[  427.317228] bf00: ef814b00 c0038868 ef814b00 ef814b14 ee82a000 ef814b00 ef390318 ef814b14
[  427.325387] bf20: ee82a000 00000008 c0831b3c ef390300 ef814b00 c0038b80 c0800100 ef390300
[  427.333545] bf40: c0038b34 00000000 ef35ad40 ef390300 c0038b34 00000000 00000000 00000000
[  427.341705] bf60: 00000000 c003dfc8 f7fee773 00000000 ffbfeefa ef390300 00000000 00000000
[  427.349864] bf80: ee82bf80 ee82bf80 00000000 00000000 ee82bf90 ee82bf90 ee82bfac ef35ad40
[  427.358022] bfa0: c003deec 00000000 00000000 c000f738 00000000 00000000 00000000 00000000
[  427.366181] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  427.374341] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 fdff4b97 7ea7a83e
[  427.382520] [] (elink_mailbox_irq_handler_bh) from [] (process_one_work+0x118/0x3e4)
[  427.391967] [] (process_one_work) from [] (worker_thread+0x4c/0x4f4)
[  427.400039] [] (worker_thread) from [] (kthread+0xdc/0xf4)
[  427.407248] [] (kthread) from [] (ret_from_fork+0x14/0x3c)
[  427.414444] Code: e340300f e0823003 e5933000 f57ff04f (e3130001)
[  427.420517] ---[ end trace b91f20ace805ac4b ]---
Jul 29 12:26:28 parallella-hdmi user.alert kernel: [  427.200795] U[  427.426466] Unable to handle kernel paging request at virtual address ffffffec
[  427.439062] pgd = c0004000
[  427.441752] [ffffffec] *pgd=2fffd861, *pte=00000000, *ppte=00000000
[  427.448005] Internal error: Oops: 37 [#2] PREEMPT SMP ARM

I am wondering if there might be another issue perhaps with using mutex_lock rather than spin_lock_irqsave() in elink_mailbox_irq_handler_bh. The failure occurs at random times, sometimes after 4 interrupts sometimes after as may as 20 interrupts. Also deadbeef in r3 is a bit of a give away as that definitely comes from a bad access over the elink.

I made a massive difference and almost removed this problem by caching the rxcfg register for the elink

struct elink_device {
...
    union elink_rxcfg rxcfg;
...
}
...
static void elink_mailbox_irq_enable(struct elink_device *elink)
{
    elink->rxcfg.mailbox_irq_en = 1;
    reg_write(elink->rxcfg.reg, elink->regs, ELINK_RXCFG);
}

See https://github.com/peteasa/meta-parallella/blob/a6e8e77582c3000d07331ac950ea6ace2e02565b/recipes-kernel/linux/linux-parallella/4.4/0011-Cache-elink-rxcfg-reg.patch for the full patch and https://github.com/peteasa/examples/tree/903828157c98a4982b1baee450f99a5043087a03/epiphany/mailbox for the test.

Using a cache for rxcfg this avoids having to read the rxcfg register. So deadbeef never corrupts the rxcfg.

Now the test runs almost without failure..

Peter.

parallella / parallella-linux

mailbox read not using epoll can not be cancelled #12