Fun with flags, presented by bgrt_mtx_free()

shkolnick-kun / bugurtos

Breaking backward compatibility since 2010!!!

GNU General Public License v3.0

132 stars 9 forks source link

Fun with flags, presented by bgrt_mtx_free() #43

Closed beefdeadbeef closed 4 months ago

beefdeadbeef commented 6 months ago

Ok, not flags, but return values. Consider two processes, of equal lowest priority, one constantly looping over lock/free, and another, trying to interact with common shared resource -- classics. Both lock/free calls followed by asserts on BGRT_ST_OK return values -- which fails for me regularly on mtx_free(). First, BGRT_ST_EEMPTY, often. What is the practical meaning ? No other processes were blocked lately by this lock ? Why not return BGRT_ST_OK in that case and be done with it ? Otherwise, is it safe to treat EEMPTY as sort of OK ? Rare and scary one, BGRT_ST_SCALL -- same questions.

shkolnick-kun commented 6 months ago

Hi, @beefdeadbeef!

First, BGRT_ST_EEMPTY, often. What is the practical meaning ?

The meaning is we are trying to wake a process from a ditry sync object with empty queue . Dirty means some process is going to sleep but it did not yet. So that's a bug in the current mutex implementation. I'm fixing it now.

Rare and scary one, BGRT_ST_SCALL -- same questions.

Something went wrong!

Could you post your test code here?

beefdeadbeef commented 6 months ago

For BGRT_ST_SCALL, it usually takes several hours of work under load until fail, not sure if i could prepare reasonably small test for that.

https://github.com/beefdeadbeef/ecmit/blob/master/mk/lwip/bgrt/sys_arch.c#L97

shkolnick-kun commented 6 months ago

OK! I've studied the sync code. The code was written long time ago so I haven't read it for several years. The following quote of my previous post is incorrect:

Dirty means some process is going to sleep but it did not yet. So that's a bug in the current mutex implementation. I'm fixing it now.

In fact when some process calls bgrt_mtx_free the mutex is owned by the calling process so:

BGRT_ST_EEMPTY is returned when there are no processes waiting to lock the mutex;
and BGRT_ST_OK is returned when there is at least one waiting process which eventually locks the mutex.

So now I think that returning BGRT_ST_EEMPTY is normal...

For BGRT_ST_SCALL, it usually takes several hours of work under load until fail, not sure if i could prepare reasonably small test for that.

https://github.com/beefdeadbeef/ecmit/blob/master/mk/lwip/bgrt/sys_arch.c#L97

Can you help me with further investigation? I don't have the board which can run ecmit so I'll be glad if you start a gdb session on your target board with the following breakpoints set:

one ,
two ,
three .

The thirst one means some mutex related bug others mean something else. A call stack and some object states may be needed to trace the issue.

beefdeadbeef commented 6 months ago

OK! I've studied the sync code. The code was written long time ago so I haven't read it for several years. The following quote of my previous post is incorrect:

Dirty means some process is going to sleep but it did not yet. So that's a bug in the current mutex implementation. I'm fixing it now.

In fact when some process calls bgrt_mtx_free the mutex is owned by the calling process so:
* `BGRT_ST_EEMPTY` is returned when there are no processes waiting to lock the mutex;

* and `BGRT_ST_OK` is returned when there is at least one waiting process which eventually locks the mutex.

So it seems my guess was close enough. Then again, why on earth it is nessesary to distinguish between these two ? How calling code can benefit from this ? What will be lost if bgrt_mtx_free() just returns BGRT_ST_OK in both cases, behaving like any other mutex out there ? Not that it is a great deal to check for both OK and EEMPTY in assert, but still.

So now I think that returning BGRT_ST_EEMPTY is normal...

For BGRT_ST_SCALL, it usually takes several hours of work under load until fail, not sure if i could prepare reasonably small test for that. https://github.com/beefdeadbeef/ecmit/blob/master/mk/lwip/bgrt/sys_arch.c#L97

Can you help me with further investigation? I don't have the board which can run ecmit so I'll be glad if you start a gdb session on your target board with the following breakpoints set:
* [one](https://github.com/shkolnick-kun/bugurtos/blob/master/kernel/sync.c#L839) ,

* [two](https://github.com/shkolnick-kun/bugurtos/blob/master/kernel/syscall.c#L101) ,

* [three](https://github.com/shkolnick-kun/bugurtos/blob/master/kernel/syscall.c#L171) .
The thirst one means some mutex related bug others mean something else. A call stack and some object states may be needed to trace the issue.

Sure, I could try with first one, as it happened quite often. As for 2/3, do you still have f4-disco around ? It seems it will be more productive if i just port this thing on f407, which i have too (not disco, but close)

beefdeadbeef commented 6 months ago

here you go: https://github.com/beefdeadbeef/ecmit/tree/wip/disco

make DEBUG=1 RTT=1, if you want verbose asserts (your probe must support RTT ofc), otherwise just make

beefdeadbeef commented 6 months ago

fw.lst.txt

Breakpoint 12, _do_int_scall (kblock=) at bgrt/kernel/kernel.c:93 93 uspd->scret = scret; => 0x0800ad38 <main.isra.0+1048>: 2308 movs r3, #8 0x0800ad3a <main.isra.0+1050>: f885 30c5 strb.w r3, [r5, #197] @ 0xc5

(gdb) p *uspd $1 = { scarg = 0x20002797 <memp_memory_POOL_1024_base+1999>, scnum = 21 '\025', scret = 0 '\000' } (gdb) x/4wx uspd->scarg 0x20002797 <memp_memory_POOL_1024_base+1999>: 0x007d7400 0x00006108 0x0001b000 0x0010fc20

shkolnick-kun commented 6 months ago

here you go: https://github.com/beefdeadbeef/ecmit/tree/wip/disco

make DEBUG=1 RTT=1, if you want verbose asserts (your probe must support RTT ofc), otherwise just make

Can't build:

libopencm3/mk/genlink-config.mk:65: libopencm3/lib/libopencm3_at32f40x.a library variant for the selected device does not exist.
  GENHDR  at32/f40x
  BUILD   lib/at32/f40x
  CC      adc.c
  CC      adc_common_v1.c
  CC      can.c
  CC      crs.c
  CC      dac_common_all.c
  CC      dac_common_v1.c
  CC      desig_common_all.c
  CC      desig_common_v1.c
  CC      dma_common_at32.c
  CC      dma_common_l1f013.c
  CC      exti_common_all.c
  CC      gpio.c
  CC      gpio_common_all.c
  CC      i2c_common_v1.c
  CC      iwdg_common_all.c
  CC      pwr_common_v1.c
  CC      rcc.c
  CC      rcc_common_at32.c
  CC      rcc_common_all.c
  CC      rtc.c
  CC      spi_common_all.c
  CC      spi_common_v1.c
  CC      timer_common_at32.c
  CC      timer_common_all.c
  CC      usart_common_all.c
  CC      usart_common_f124.c
  CC      usb.c
  CC      usb_control.c
  CC      usb_standard.c
  CC      usb_msc.c
  CC      usb_hid.c
  CC      usb_audio.c
  CC      usb_cdc.c
  CC      usb_midi.c
  CC      st_usbfs_core.c
  CC      st_usbfs_v1.c
  CC      vector.c
  CC      systick.c
  CC      scb.c
  CC      nvic.c
  CC      assert.c
  CC      sync.c
  CC      dwt.c
  AR      libopencm3_at32f40x.a
make: *** No rule to make target «fw.elf», needed by «all».  Stop.

shkolnick-kun commented 6 months ago

fw.lst.txt

Breakpoint 12, _do_int_scall (kblock=) at bgrt/kernel/kernel.c:93 93 uspd->scret = scret; => 0x0800ad38 <main.isra.0+1048>: 2308 movs r3, #8 0x0800ad3a <main.isra.0+1050>: f885 30c5 strb.w r3, [r5, #197] @ 0xc5

(gdb) p *uspd $1 = { scarg = 0x20002797 <memp_memory_POOL_1024_base+1999>, scnum = 21 '\025', scret = 0 '\000' } (gdb) x/4wx uspd->scarg 0x20002797 <memp_memory_POOL_1024_base+1999>: 0x007d7400 0x00006108 0x0001b000 0x0010fc20

In case of default syscall_handler table 20 is the last valid number.

Looks like memory corruption in memp_memory_POOL_1024_base.

That's why I don't like using dynamic memory: there is no reliable method to clarify what is allocated near the corrupted object and who has access to it...

beefdeadbeef commented 6 months ago

libopencm3/mk/genlink-config.mk:65: libopencm3/lib/libopencm3_at32f40x.a library variant for the selected device does > not exist. GENHDR at32/f40x it seems you're building master branch $ git clone --recursive https://github.com/beefdeadbeef/ecmit $ git checkout -b wip/disco origin/wip/disco $ make DEBUG=1 RTT=1 (first time twice, for notorious libopencm3 idiosyncrasy)

beefdeadbeef commented 6 months ago

fw.lst.txt Breakpoint 12, _do_int_scall (kblock=) at bgrt/kernel/kernel.c:93 93 uspd->scret = scret; => 0x0800ad38 <main.isra.0+1048>: 2308 movs r3, #8 0x0800ad3a <main.isra.0+1050>: f885 30c5 strb.w r3, [r5, #197] @ 0xc5 (gdb) p *uspd $1 = { scarg = 0x20002797 <memp_memory_POOL_1024_base+1999>, scnum = 21 '\025', scret = 0 '\000' } (gdb) x/4wx uspd->scarg 0x20002797 <memp_memory_POOL_1024_base+1999>: 0x007d7400 0x00006108 0x0001b000 0x0010fc20

In case of default syscall_handler table 20 is the last valid number.

Looks like memory corruption in memp_memory_POOL_1024_base.

That's why I don't like using dynamic memory: there is no reliable method to clarify what is allocated near the corrupted object and who has access to it...

Well, not much dynamic, just few pools of fixed size chunks -- but yes, that's possible. Still, memp_memory_POOL_1024 contains two 1k-sized chunks, upper one used for tcpip_thread process stack, and i see nothing incriminating so far: memp_memory_POOL_1024_base+1999 offset is within that range. As for scarg contents -- i'm not sure

shkolnick-kun commented 6 months ago

Well, not much dynamic, just few pools of fixed size chunks -- but yes, that's possible. Still, memp_memory_POOL_1024 contains two 1k-sized chunks, upper one used for tcpip_thread process stack, and i see nothing incriminating so far: memp_memory_POOL_1024_base+1999 offset is within that range. As for scarg contents -- i'm not sure

Yes, but the interesting thing is memory allocation for the calling process: https://github.com/beefdeadbeef/ecmit/blob/master/mk/lwip/bgrt/sys_arch.c#L288

The uspd structure seats at the end of the struct bgrt_priv_proc_t. So if there is some stack like object next to the thread.proc then something can smash scret and then scnum...

beefdeadbeef commented 6 months ago

I do not think this is the case: see lines 23-25 of same file. sync, queue and proc objects are allocated in own private pools. (gdb) p memp_bgrt_proc.base $10 = (u8_t ) 0x2000767c "" (gdb) p &memp_memory_POOL_1024_base $12 = (u8_t ()[2059]) 0x20001fc8

beefdeadbeef commented 6 months ago

Yes, but the interesting thing is memory allocation for the calling process: https://github.com/beefdeadbeef/ecmit/blob/master/mk/lwip/bgrt/sys_arch.c#L288

The uspd structure seats at the end of the struct bgrt_priv_proc_t. So if there is some stack like object next to the thread.proc then something can smash scret and then scnum...

Funny though, i've had in one of previous iterations just like that: combined 'struct bgrt_priv_proc_t + stack', and had smashed bottom of said struct with insufficient stack space. leftover of that apparently bad design is still on line 27 :]

shkolnick-kun commented 5 months ago

I do not think this is the case: see lines 23-25 of same file. sync, queue and proc objects are allocated in own private pools. (gdb) p memp_bgrt_proc.base $10 = (u8t ) 0x2000767c "" (gdb) p &memp_memory_POOL_1024_base $12 = (u8t ()[2059]) 0x20001fc8

Well... Then the only possible issue I can think of might be a system call leakage...

I've added one more status value to trace this

So it seems my guess was close enough. Then again, why on earth it is nessesary to distinguish between these two ? How calling code can benefit from this ?

And I also have addressed this issue.

Could you test the master branch?

beefdeadbeef commented 5 months ago

straight to bull's eye. not sure whose, though.

[00000000] mbox=20008064 size=0
[00000000] mutex=20007CC8
[00000000] thread[tcpip_thread]=200078DC fn=08009339 prio=31 stack=256
[00000000] idle(): init done
[00000001] mutex=20007CC8 st=0
[00000001] thread[ecm_input]=20007814 fn=08002939 prio=31 stack=128
[00000064] mutex=20007CC8 st=0
[000000c8] mutex=20007CC8 st=0
[0000012c] mutex=20007CC8 st=0
[00000190] mutex=20007CC8 st=0
[000001e0] altset_cb: wIndex: 1 wValue: 1
[000001e0] mutex=20007CC8 st=11
[000001e0] Assertion "st == BGRT_ST_OK" failed at line 88 in mk/lwip/bgrt/sys_arch.c

with debug line

LWIP_DEBUGF(SYS_ARCH_DEBUG, ("mutex=%p st=%d\n", mutex->mtx, st));

added just after bgrt_mtx_lock() here

first five are locks from same process, this one:

[000001e0] mutex=20007CC8 st=11

from another:

(gdb) bt
#0  0x0800929a in sys_mutex_lock.constprop.0 (mutex=<optimized out>) at mk/lwip/bgrt/sys_arch.c:88
#1  0x08001964 in tcpip_api_call (fn=<optimized out>, call=0x20017c98) at lwip/src/api/tcpip.c:478
#2  netifapi_netif_common (netif=0x20000000 <ecmif>, errtfunc=0x0, voidfunc=<optimized out>) at lwip/src/api/netifapi.c:313
#3  ecm_altset_cb (usbd_dev=<optimized out>, wIndex=1, wValue=1) at ecm.c:288
#4  0x08000314 in usb_standard_set_interface (usbd_dev=<optimized out>, req=0x200000b4 <st_usbfs_dev+80>, buf=<optimized out>, len=0x200000c0 <st_usbfs_dev+92>) at libopencm3/lib/at32/f40x/../../usb/usb_standard.c:315
#5  0x08000ff6 in usb_control_setup_read (usbd_dev=0x20000064 <st_usbfs_dev>, req=0x200000b4 <st_usbfs_dev+80>) at libopencm3/lib/at32/f40x/../../usb/usb_control.c:179
#6  0x08000de0 in st_usbfs_poll (dev=0x20000064 <st_usbfs_dev>) at libopencm3/lib/at32/f40x/../../stm32/common/st_usbfs_core.c:317
#7  0x08000202 in usbd_poll (usbd_dev=<optimized out>) at libopencm3/lib/at32/f40x/../../usb/usb.c:128
#8  usbd_intr (udev=<optimized out>) at ecm.c:358
#9  0x0800af50 in bgrt_vic_iterator (vic=<optimized out>) at bgrt/kernel/vint.c:181
#10 bgrt_kblock_do_work (kblock=<optimized out>) at bgrt/kernel/kernel.c:177
#11 bgrt_kblock_main (kblock=<optimized out>) at bgrt/kernel/kernel.c:215
#12 bgrt_start () at bgrt/arch/cm4f/gcc/bugurt_port.c:221
#13 main.isra.0 () at main.c:84
#14 0x08001402 in reset_handler () at libopencm3/lib/at32/f40x/../../cm3/vector.c:93

shkolnick-kun commented 5 months ago

straight to bull's eye. not sure whose, though.

[00000000] mbox=20008064 size=0
[00000000] mutex=20007CC8
[00000000] thread[tcpip_thread]=200078DC fn=08009339 prio=31 stack=256
[00000000] idle(): init done
[00000001] mutex=20007CC8 st=0
[00000001] thread[ecm_input]=20007814 fn=08002939 prio=31 stack=128
[00000064] mutex=20007CC8 st=0
[000000c8] mutex=20007CC8 st=0
[0000012c] mutex=20007CC8 st=0
[00000190] mutex=20007CC8 st=0
[000001e0] altset_cb: wIndex: 1 wValue: 1
[000001e0] mutex=20007CC8 st=11
[000001e0] Assertion "st == BGRT_ST_OK" failed at line 88 in mk/lwip/bgrt/sys_arch.c

with debug line

LWIP_DEBUGF(SYS_ARCH_DEBUG, ("mutex=%p st=%d\n", mutex->mtx, st));

added just after bgrt_mtx_lock() here

first five are locks from same process, this one:

[000001e0] mutex=20007CC8 st=11

from another:

(gdb) bt
#0  0x0800929a in sys_mutex_lock.constprop.0 (mutex=<optimized out>) at mk/lwip/bgrt/sys_arch.c:88
#1  0x08001964 in tcpip_api_call (fn=<optimized out>, call=0x20017c98) at lwip/src/api/tcpip.c:478
#2  netifapi_netif_common (netif=0x20000000 <ecmif>, errtfunc=0x0, voidfunc=<optimized out>) at lwip/src/api/netifapi.c:313
#3  ecm_altset_cb (usbd_dev=<optimized out>, wIndex=1, wValue=1) at ecm.c:288
#4  0x08000314 in usb_standard_set_interface (usbd_dev=<optimized out>, req=0x200000b4 <st_usbfs_dev+80>, buf=<optimized out>, len=0x200000c0 <st_usbfs_dev+92>) at libopencm3/lib/at32/f40x/../../usb/usb_standard.c:315
#5  0x08000ff6 in usb_control_setup_read (usbd_dev=0x20000064 <st_usbfs_dev>, req=0x200000b4 <st_usbfs_dev+80>) at libopencm3/lib/at32/f40x/../../usb/usb_control.c:179
#6  0x08000de0 in st_usbfs_poll (dev=0x20000064 <st_usbfs_dev>) at libopencm3/lib/at32/f40x/../../stm32/common/st_usbfs_core.c:317
#7  0x08000202 in usbd_poll (usbd_dev=<optimized out>) at libopencm3/lib/at32/f40x/../../usb/usb.c:128
#8  usbd_intr (udev=<optimized out>) at ecm.c:358
#9  0x0800af50 in bgrt_vic_iterator (vic=<optimized out>) at bgrt/kernel/vint.c:181
#10 bgrt_kblock_do_work (kblock=<optimized out>) at bgrt/kernel/kernel.c:177
#11 bgrt_kblock_main (kblock=<optimized out>) at bgrt/kernel/kernel.c:215
#12 bgrt_start () at bgrt/arch/cm4f/gcc/bugurt_port.c:221
#13 main.isra.0 () at main.c:84
#14 0x08001402 in reset_handler () at libopencm3/lib/at32/f40x/../../cm3/vector.c:93

According to back trace sys_mutex_lock was called from bgrt_kblock_main i.e. from the kernel thread and not from some process.

This led to undefined behaviour...

shkolnick-kun commented 5 months ago

I would recommend you to make a process for usbd_poll and to communicate with it by a counting semaphore.

Anyway thank you for the interesting puzzle!

beefdeadbeef commented 5 months ago

do not get cocky yet :) this was just startup code path, and we're not in main loop yet, stay tuned :]

shkolnick-kun commented 5 months ago

do not get cocky yet :) this was just startup code path, and we're not in main loop yet, stay tuned :]

This is the root of the problem:

BGRT_ISR(__usb_isr)
{
    nvic_disable_irq(__usb_irq);
    bgrt_vint_push_isr(&usbd_vint, &bgrt_kernel.kblock.vic);
}

Every NVIC_USB_LP_IRQ generates a virtual interrupt which leads to usbd_poll call from the kernel thread which leads to sys_mutex_lock and sys_mutex_unlock (and may be some more) calls each of which leads to a bgrt_syscall call which leads to UB.

beefdeadbeef commented 5 months ago

do not get cocky yet :) this was just startup code path, and we're not in main loop yet, stay tuned :]

This is the root of the problem:
BGRT_ISR(__usb_isr)
{
  nvic_disable_irq(__usb_irq);
  bgrt_vint_push_isr(&usbd_vint, &bgrt_kernel.kblock.vic);
}
Every NVIC_USB_LP_IRQ generates a virtual interrupt which leads to usbd_poll call from the kernel thread which leads to sys_mutex_lock and sys_mutex_unlock (and may be some more) calls each of which leads to a bgrt_syscall call which leads to UB.

Yes, that is clear. But: mutex ops in vint context were (mistakenly) added few days ago, and i've seen SCALLs earlier too. Now i've changed that to bgrt_map_t flag set from there and actual work in ecm_input proc. No other bgrt API except bgrt_atm* and bgrt_sem_free_cs is used in usb vint context. I'm currently stress-testing, but this might take hours, let 's see. On making usb_poll() ordinary proc -- no, that isn't option, for performance and timing reasons.

shkolnick-kun commented 5 months ago

There is definitely something wrong with the stack. The normal stack should have "bgrt_proc_terminate" on the bottom.

сб, 8 июн. 2024 г. в 23:16, beefdeadbeef @.***>:

(gdb) r Starting program: /home/lioka/src/mk/at32/f403/ecmit/fw.elf

Program received signal SIGTRAP, Trace/breakpoint trap. 0x08009366 in sys_mutex_lock.constprop.0 (mutex=) at mk/lwip/bgrt/sys_arch.c:88 88 LWIP_ASSERT("st == BGRT_ST_OK", st == BGRT_ST_OK); 0x0800935a <sys_mutex_lock.constprop.0+106>: 4a0b ldr r2, [pc, #44] @ (0x8009388 <sys_mutex_lock.constprop.0+152>) 0x0800935c <sys_mutex_lock.constprop.0+108>: 9300 str r3, [sp, #0] 0x0800935e <sys_mutex_lock.constprop.0+110>: 4808 ldr r0, [pc, #32] @ (0x8009380 <sys_mutex_lock.constprop.0+144>) 0x08009360 <sys_mutex_lock.constprop.0+112>: 2358 movs r3, #88 @ 0x58 0x08009362 <sys_mutexlock.constprop.0+114>: f7f8 f8c1 bl 0x80014e8 <printf> => 0x08009366 <sys_mutex_lock.constprop.0+118>: be00 bkpt 0x0000 (gdb) bt

0 0x08009366 in sys_mutex_lock.constprop.0 (mutex=) at mk/lwip/bgrt/sys_arch.c:88

1 0x0800947e in tcpip_timeouts_mbox_fetch (mbox=, msg=) at lwip/src/api/tcpip.c:105

2 tcpip_thread (arg=) at lwip/src/api/tcpip.c:142

3 0x08007fbc in bgrt_syscall (num=, arg=0x4f080006) at bgrt/arch/cm4f/gcc/bugurt_port.c:252

Backtrace stopped: previous frame identical to this frame (corrupt stack?)

[00000000] mbox=20008064 size=0 [00000000] mutex=20007CC8 [00000000] thread[tcpip_thread]=200078DC fn=08009415 prio=31 stack=256 [00000000] idle(): init done [00000001] thread[ecm_input]=20007814 fn=08002911 prio=31 stack=128 [000001e1] altset_cb: wIndex: 1 wValue: 1 [000001e2] cs_cb: bRequest: 43 wValue: 000c wIndex: 0000 len: 0 data=14 [00000203] cs_cb: bRequest: 43 wValue: 000e wIndex: 0000 len: 0 data=1a [00000214] cs_cb: bRequest: 43 wValue: 000e wIndex: 0000 len: 0 data=1a [00000219] iface is up [00003264] address: 172.22.22.133 [0000c670] mutex=20007CC8 st=8 [0000c670] Assertion "st == BGRT_ST_OK" failed at line 88 in mk/lwip/bgrt/sys_arch.c

sigh. where to look ?

— Reply to this email directly, view it on GitHub https://github.com/shkolnick-kun/bugurtos/issues/43#issuecomment-2156128217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4JAXTTAPOOOEVKRPFL4JDZGNDBTAVCNFSM6AAAAABIZCNJJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJWGEZDQMRRG4 . You are receiving this because you commented.Message ID: @.***>

shkolnick-kun commented 5 months ago

Hi, @beefdeadbeef!

Are there any news on the issue?

beefdeadbeef commented 5 months ago

Hi, @beefdeadbeef!

Are there any news on the issue?

I've got several stack traces like this, and no matter what i tried to do (raised stack spaces, different locations, etc), they all amost identical:

(gdb) bt
#0  0x08009362 in sys_mutex_lock.constprop.0 (mutex=<optimized out>)
    at mk/lwip/bgrt/sys_arch.c:88
#1  0x0800947a in tcpip_timeouts_mbox_fetch (mbox=<optimized out>,
    msg=<optimized out>) at lwip/src/api/tcpip.c:105
#2  tcpip_thread (arg=<optimized out>) at lwip/src/api/tcpip.c:142
#3  0x08007fb8 in bgrt_syscall (num=<optimized out>, arg=0x4f080006)
    at bgrt/arch/cm4f/gcc/bugurt_port.c:252
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) bt
#0  0x0800932c in sys_mutex_lock.constprop.0 (mutex=<optimized out>)
    at mk/lwip/bgrt/sys_arch.c:86
#1  0x08009436 in tcpip_timeouts_mbox_fetch (mbox=<optimized out>,
    msg=<optimized out>) at lwip/src/api/tcpip.c:105
#2  tcpip_thread (arg=<optimized out>) at lwip/src/api/tcpip.c:142
#3  0x08007ff4 in bgrt_syscall (num=<optimized out>, arg=0x4f080006)
    at bgrt/arch/cm4f/gcc/bugurt_port.c:252
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) bt
#0  0x080093a2 in sys_mutex_unlock.constprop.0 (mutex=<optimized out>)
    at mk/lwip/bgrt/sys_arch.c:97
#1  0x08009428 in tcpip_timeouts_mbox_fetch (mbox=<optimized out>,
    msg=<optimized out>) at lwip/src/api/tcpip.c:103
#2  tcpip_thread (arg=<optimized out>) at lwip/src/api/tcpip.c:142
#3  0x08007ff4 in bgrt_syscall (num=<optimized out>, arg=0x4f080006)
    at bgrt/arch/cm4f/gcc/bugurt_port.c:252
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) bt
#0  0x0800931a in sys_sem_signal.constprop.0 (sem=<optimized out>)
    at mk/lwip/bgrt/sys_arch.c:85
#1  0x0800992c in tcpip_timeouts_mbox_fetch (mbox=<optimized out>, 
    msg=<optimized out>) at lwip/src/api/tcpip.c:103
#2  tcpip_thread (arg=<optimized out>) at lwip/src/api/tcpip.c:142
#3  0x08007ff4 in bgrt_syscall (num=<optimized out>, arg=0x4f080006)
    at bgrt/arch/cm4f/gcc/bugurt_port.c:252
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) bt
#0  0x080084d2 in sys_mutex_unlock (mutex=<optimized out>)
    at mk/lwip/bgrt/sys_arch.c:120
#1  sys_unlock_tcpip_core () at mk/lwip/bgrt/sys_arch.c:47
#2  0x08008914 in tcpip_timeouts_mbox_fetch (mbox=<optimized out>,
    msg=<optimized out>) at lwip/src/api/tcpip.c:103
#3  tcpip_thread (arg=<optimized out>) at lwip/src/api/tcpip.c:142
#4  0x08008098 in bgrt_syscall (num=<optimized out>, arg=0x64f0800)
    at bgrt/arch/cm4f/gcc/bugurt_port.c:252
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) bt
#0  0x0800931e in sys_mutex_lock.constprop.0 (mutex=<optimized out>)
    at mk/lwip/bgrt/sys_arch.c:87
#1  0x0800944e in tcpip_timeouts_mbox_fetch (mbox=<optimized out>,
    msg=<optimized out>) at lwip/src/api/tcpip.c:105
#2  tcpip_thread (arg=<optimized out>) at lwip/src/api/tcpip.c:142
#3  0x08007fb4 in bgrt_syscall (num=<optimized out>, arg=0x4f080006)
    at bgrt/arch/cm4f/gcc/bugurt_port.c:252
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

especially arg= in bgrt_syscall (even rotated over once). I 'm out of ideas. And as LWIP has basically two concurrency models: (one with global mutex, another with callback queue), i'm going for latter, which works reliably for me.

beefdeadbeef commented 5 months ago

btw (gdb) x 0x0800064f 0x800064f : 0x034770e7

from libopencm3/lib/cm3/vector.c

maybe having more clewer handler than just while(1) there could be beneficial for bugurtos, no ?

shkolnick-kun commented 5 months ago

btw (gdb) x 0x0800064f 0x800064f : 0x034770e7

from libopencm3/lib/cm3/vector.c

maybe having more clewer handler than just while(1) there could be beneficial for bugurtos, no ?

This is intended to be used for fault debugging. You can parse fault status registers in this while(1).

shkolnick-kun commented 5 months ago

btw (gdb) x 0x0800064f 0x800064f : 0x034770e7

from libopencm3/lib/cm3/vector.c

maybe having more clewer handler than just while(1) there could be beneficial for bugurtos, no ?

BTW, could you use -Og instead of -Os while debugging? In such case you can get rid of <optimized out> thing.

beefdeadbeef commented 5 months ago

Here's with -Og and without -flto after some half hour running:

(gdb) r
Starting program: /home/lioka/src/mk/at32/f403/ecmit/fw.elf 

Program received signal SIGSEGV, Segmentation fault.
bgrt_priv_sync_wake (sync=0x4f85db0, proc=0x70b004eb, chown=71 'G') at bgrt/kernel/sync.c:801
801     owner = sync->owner;
=> 0x08011340 <bgrt_priv_sync_wake+16>: f8d4 6084   ldr.w   r6, [r4, #132]  @ 0x84
(gdb) bt
#0  bgrt_priv_sync_wake (sync=0x4f85db0, proc=0x70b004eb, chown=71 'G') at bgrt/kernel/sync.c:801
#1  0x0800fb14 in BGRT_SC_SYNC_WAKE_SR (va=<optimized out>) at bgrt/kernel/default/syscall_routines.h:256
#2  0x0800fbb2 in bgrt_priv_do_syscall (syscall_num=<optimized out>, syscall_arg=<optimized out>) at bgrt/kernel/syscall.c:176
#3  0x0800ef58 in _do_int_scall (kblock=kblock@entry=0x200080f0 <bgrt_kernel>) at bgrt/kernel/kernel.c:92
#4  0x0800f0f2 in bgrt_kblock_do_work (kblock=kblock@entry=0x200080f0 <bgrt_kernel>) at bgrt/kernel/kernel.c:185
#5  0x0800f11a in bgrt_kblock_main (kblock=0x200080f0 <bgrt_kernel>) at bgrt/kernel/kernel.c:215
#6  0x0801019e in bgrt_start () at bgrt/arch/cm4f/gcc/bugurt_port.c:221
#7  0x08000956 in main () at main.c:84
(gdb) f 4
#4  0x0800f0f2 in bgrt_kblock_do_work (kblock=kblock@entry=0x200080f0 <bgrt_kernel>) at bgrt/kernel/kernel.c:185
185             _do_int_scall(kblock);
   0x0800f0ec <bgrt_kblock_do_work+112>:    4620        mov r0, r4
   0x0800f0ee <bgrt_kblock_do_work+114>:    f7ff ff25   bl  0x800ef3c <_do_int_scall>
(gdb) p *kblock
$10 = {
  vic = {
    list = {
      item = {0x0 <repeats 32 times>},
      map = 0
    },
    prio = 32 ' '
  },
  hpmap = 0,
  lpmap = 0
}
(gdb) f 3
#3  0x0800ef58 in _do_int_scall (kblock=kblock@entry=0x200080f0 <bgrt_kernel>) at bgrt/kernel/kernel.c:92
92      scret = bgrt_priv_do_syscall(uspd->scnum, uspd->scarg);
   0x0800ef50 <_do_int_scall+20>:   f8d4 10c0   ldr.w   r1, [r4, #192]  @ 0xc0
   0x0800ef54 <_do_int_scall+24>:   f000 fe1d   bl  0x800fb92 <bgrt_priv_do_syscall>
(gdb) p *uspd
$11 = {
  scarg = 0x20002478 <memp_memory_POOL_1024_base+1940>,
  scnum = 17 '\021',
  scret = 0 '\000'
}
(gdb) x/4wx uspd->scarg
0x20002478 <memp_memory_POOL_1024_base+1940>:   0x0800fbd5  0x00000063  0x20002490  0x0000000e
(gdb)

it seems it went from here:

0800fbc0 <bgrt_syscall_var>:
 800fbc0:       b40f            push    {r0, r1, r2, r3}
 800fbc2:       b500            push    {lr}
 800fbc4:       b083            sub     sp, #12
 800fbc6:       ab05            add     r3, sp, #20
 800fbc8:       9301            str     r3, [sp, #4]
 800fbca:       f89d 0010       ldrb.w  r0, [sp, #16]
 800fbce:       a901            add     r1, sp, #4
 800fbd0:       f000 fb0e       bl      80101f0 <bgrt_syscall>
 800fbd4:       b003            add     sp, #12
 800fbd6:       f85d eb04       ldr.w   lr, [sp], #4
 800fbda:       b004            add     sp, #16
 800fbdc:       4770            bx      lr
 800fbde:       bf00            nop

shkolnick-kun commented 5 months ago

bgrt_priv_sync_wake (sync=0x4f85db0, proc=0x70b004eb, chown=71 'G') it seems it went from here: 0800fbc0 <bgrt_syscall_var>:

OK.

//We use variadic function to handle multiple arguments... 
//It is iplementation defined whether to use stack only or registers and stack...
bgrt_st_t bgrt_syscall_var(bgrt_syscall_t num, ...)
{
    bgrt_va_wr_t varg; //This variable is located on the stack...
    bgrt_st_t ret;

    va_start(varg.list, num); //Now all arguments are on the stack!
    ret = bgrt_syscall(num, (void *)&varg);
    va_end(varg.list);

    return ret;
}

//If the stack gets corrupted before calling the service routine...
BGRT_SC_SR(SYNC_WAKE,  bgrt_va_wr_t * va) /* ADLINT:SL:[W0031] not used*/
{
    va_list param;
    bgrt_sync_t * sync;
    BGRT_PID_T pid;
    bgrt_flag_t chown;

    //Then we shall retrieve garbage from the process stack here
    va_copy(param, va->list);
    sync  = (bgrt_sync_t *)va_arg(param, void *);
    pid   = (BGRT_PID_T  )va_arg(param, void *);
    chown = (bgrt_flag_t )va_arg(param, int  );
    va_end(param);

    //And then we shall call
    //bgrt_priv_sync_wake (sync=0x4f85db0, proc=0x70b004eb, chown=71 'G')
    return bgrt_priv_sync_wake(sync, BGRT_PID_TO_PROC(pid), chown);
}

Seems like this is a stack corruption...

beefdeadbeef commented 5 months ago

bgrt_st_t bgrt_syscall_var(bgrt_syscall_t num, ...)

bgrt_syscall_t in all bugurt_config_example.h is defined to unsigned char. afair last named argument in variadic function is subject to default promotion, to int in this case, and having there char is asking for troubles just like that, no ? let me check this ...

beefdeadbeef commented 5 months ago

bgrt_st_t bgrt_syscall_var(bgrt_syscall_t num, ...)
bgrt_syscall_t in all bugurt_config_example.h is defined to unsigned char. afair last named argument in variadic function is subject to default promotion, to int in this case, and having there char is asking for troubles just like that, no ? let me check this ...

in case you wondering -- nope, that's not it :] back to square one.

shkolnick-kun commented 5 months ago

bgrt_st_t bgrt_syscall_var(bgrt_syscall_t num, ...)
bgrt_syscall_t in all bugurt_config_example.h is defined to unsigned char. afair last named argument in variadic function is subject to default promotion, to int in this case, and having there char is asking for troubles just like that, no ? let me check this ...
in case you wondering -- nope, that's not it :] back to square one.

I know this for sure. Have you managed to find out what it is?

beefdeadbeef commented 5 months ago

OK! I've studied the sync code. The code was written long time ago so I haven't read it for several years. The following quote of my previous post is incorrect:

Dirty means some process is going to sleep but it did not yet. So that's a bug in the current mutex implementation. I'm fixing it now.

In fact when some process calls bgrt_mtx_free the mutex is owned by the calling process so:
* `BGRT_ST_EEMPTY` is returned when there are no processes waiting to lock the mutex;

* and `BGRT_ST_OK` is returned when there is at least one waiting process which eventually locks the mutex.
So now I think that returning BGRT_ST_EEMPTY is normal...

For BGRT_ST_SCALL, it usually takes several hours of work under load until fail, not sure if i could prepare reasonably small test for that. https://github.com/beefdeadbeef/ecmit/blob/master/mk/lwip/bgrt/sys_arch.c#L97

Can you help me with further investigation? I don't have the board which can run ecmit so I'll be glad if you start a gdb session on your target board with the following breakpoints set:
* [one](https://github.com/shkolnick-kun/bugurtos/blob/master/kernel/sync.c#L839) ,

* [two](https://github.com/shkolnick-kun/bugurtos/blob/master/kernel/syscall.c#L101) ,

* [three](https://github.com/shkolnick-kun/bugurtos/blob/master/kernel/syscall.c#L171) .
The thirst one means some mutex related bug others mean something else. A call stack and some object states may be needed to trace the issue.

third breakpoint, just before return BGRT_ST_SCALL

Program received signal SIGTRAP, Trace/breakpoint trap.
bgrt_priv_do_syscall (syscall_arg=0x20002a3c <memp_memory_POOL_1024_base+1996>, syscall_num=21) at bgrt/kernel/syscall.c:172
172         __asm__ __volatile__ ("bkpt #2");
=> 0x0800d074 <main.isra.0+1004>:   be02        bkpt    0x0002
(gdb) bt
#0  bgrt_priv_do_syscall (syscall_arg=0x20002a3c <memp_memory_POOL_1024_base+1996>, syscall_num=21)
    at bgrt/kernel/syscall.c:172
#1  _do_int_scall (kblock=<optimized out>) at bgrt/kernel/kernel.c:92
#2  bgrt_kblock_do_work (kblock=<optimized out>) at bgrt/kernel/kernel.c:185
#3  bgrt_kblock_main (kblock=<optimized out>) at bgrt/kernel/kernel.c:215
#4  bgrt_start () at bgrt/arch/cm4f/gcc/bugurt_port.c:221
#5  main.isra.0 () at main.c:92
#6  0x080019e0 in reset_handler () at libopencm3/lib/at32/f40x/../../cm3/vector.c:93

(gdb) p *bgrt_kernel.sched.current_proc
$13 = {
  parent = {
    parent = {
      next = 0x20007f50 <memp_memory_bgrt_proc_base+208>,
      prev = 0x20017f10
    },
    list = 0x2000038c <bgrt_kernel+156>,
    prio = 31
  },
  flags = 99,
  base_prio = 31,
  lres = {
    counter = {0 <repeats 32 times>},
    map = 0
  },
  time_quant = 1,
  timer = 1,
  sync = 0x0,
  cnt_lock = 28151,
  pmain = 0x800b6bd <tcpip_thread>,
  sv_hook = 0x80089eb <sys_arch_dummy_cb>,
  rs_hook = 0x80089eb <sys_arch_dummy_cb>,
  arg = 0x20008afc <sys_arch_dummy_arg>,
  sstart = 0x20002a74 <memp_memory_POOL_1024_base+2052>,
  spointer = 0x200029e4 <memp_memory_POOL_1024_base+1908>,
  udata = {
    scarg = 0x20002a3c <memp_memory_POOL_1024_base+1996>,
    scnum = 21,
    scret = 0
  }
}

while most of proc fileds seem ok for me, proc.cnt_lock isn't -- and i have no bgrt_priv_proc_lock() at all. do you have any idea where this might come from ? no mutexes this time, just bunch of sem_lock/sem_free

shkolnick-kun commented 5 months ago

Hi!

Can you define some small BGRT_CONFIG_CNT_MAX and define BGRT_CONFIG_TEST?

This can help to track the counter leak...

пн, 24 июн. 2024 г., 00:51 beefdeadbeef @.***>:

OK! I've studied the sync code. The code was written long time ago so I haven't read it for several years. The following quote of my previous post is incorrect:

Dirty means some process is going to sleep but it did not yet. So that's a bug in the current mutex implementation. I'm fixing it now.

In fact when some process calls bgrt_mtx_free the mutex is owned by the calling process so:

BGRT_ST_EEMPTY is returned when there are no processes waiting to lock the mutex;

and BGRT_ST_OK is returned when there is at least one waiting process which eventually locks the mutex.

So now I think that returning BGRT_ST_EEMPTY is normal...

For BGRT_ST_SCALL, it usually takes several hours of work under load until fail, not sure if i could prepare reasonably small test for that.

https://github.com/beefdeadbeef/ecmit/blob/master/mk/lwip/bgrt/sys_arch.c#L97

Can you help me with further investigation? I don't have the board which can run ecmit so I'll be glad if you start a gdb session on your target board with the following breakpoints set:

one ,

two ,

three .

The thirst one means some mutex related bug others mean something else. A call stack and some object states may be needed to trace the issue.

third breakpoint, just before return BGRT_ST_SCALL

Program received signal SIGTRAP, Trace/breakpoint trap. bgrt_priv_do_syscall (syscall_arg=0x20002a3c <memp_memory_POOL_1024_base+1996>, syscall_num=21) at bgrt/kernel/syscall.c:172 172 asm volatile ("bkpt #2"); => 0x0800d074 <main.isra.0+1004>: be02 bkpt 0x0002 (gdb) bt

0 bgrt_priv_do_syscall (syscall_arg=0x20002a3c <memp_memory_POOL_1024_base+1996>, syscall_num=21)
at bgrt/kernel/syscall.c:172
1 _do_int_scall (kblock=) at bgrt/kernel/kernel.c:92

2 bgrt_kblock_do_work (kblock=) at bgrt/kernel/kernel.c:185

3 bgrt_kblock_main (kblock=) at bgrt/kernel/kernel.c:215

4 bgrt_start () at bgrt/arch/cm4f/gcc/bugurt_port.c:221

5 main.isra.0 () at main.c:92

6 0x080019e0 in reset_handler () at libopencm3/lib/at32/f40x/../../cm3/vector.c:93

(gdb) p *bgrt_kernel.sched.current_proc $13 = { parent = { parent = { next = 0x20007f50 <memp_memory_bgrt_proc_base+208>, prev = 0x20017f10 }, list = 0x2000038c <bgrt_kernel+156>, prio = 31 }, flags = 99, base_prio = 31, lres = { counter = {0 <repeats 32 times>}, map = 0 }, time_quant = 1, timer = 1, sync = 0x0, cnt_lock = 28151, pmain = 0x800b6bd , sv_hook = 0x80089eb , rs_hook = 0x80089eb , arg = 0x20008afc , sstart = 0x20002a74 <memp_memory_POOL_1024_base+2052>, spointer = 0x200029e4 <memp_memory_POOL_1024_base+1908>, udata = { scarg = 0x20002a3c <memp_memory_POOL_1024_base+1996>, scnum = 21, scret = 0 } }

while most of proc fileds seem ok for me, proc.cnt_lock isn't -- and i have no bgrt_priv_proc_lock() at all. do you have any idea where this might come from ?

— Reply to this email directly, view it on GitHub https://github.com/shkolnick-kun/bugurtos/issues/43#issuecomment-2185305566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4JAXXH7PVJGQIHIPPPYHLZI4RNHAVCNFSM6AAAAABIZCNJJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBVGMYDKNJWGY . You are receiving this because you commented.Message ID: @.***>

shkolnick-kun commented 5 months ago

And yes, scnum of 21 is not with zero scret is another problem...

пн, 24 июн. 2024 г., 14:47 Noname Nothing @.***>:

Hi!

Can you define some small BGRT_CONFIG_CNT_MAX and define BGRT_CONFIG_TEST?

This can help to track the counter leak...

пн, 24 июн. 2024 г., 00:51 beefdeadbeef @.***>:
OK! I've studied the sync code. The code was written long time ago so I haven't read it for several years. The following quote of my previous post is incorrect:

Dirty means some process is going to sleep but it did not yet. So that's a bug in the current mutex implementation. I'm fixing it now.

In fact when some process calls bgrt_mtx_free the mutex is owned by the calling process so:

BGRT_ST_EEMPTY is returned when there are no processes waiting to lock the mutex;

and BGRT_ST_OK is returned when there is at least one waiting process which eventually locks the mutex.

So now I think that returning BGRT_ST_EEMPTY is normal...

For BGRT_ST_SCALL, it usually takes several hours of work under load until fail, not sure if i could prepare reasonably small test for that.

https://github.com/beefdeadbeef/ecmit/blob/master/mk/lwip/bgrt/sys_arch.c#L97

Can you help me with further investigation? I don't have the board which can run ecmit so I'll be glad if you start a gdb session on your target board with the following breakpoints set:

one ,

two ,

three .

The thirst one means some mutex related bug others mean something else. A call stack and some object states may be needed to trace the issue.

third breakpoint, just before return BGRT_ST_SCALL

Program received signal SIGTRAP, Trace/breakpoint trap. bgrt_priv_do_syscall (syscall_arg=0x20002a3c <memp_memory_POOL_1024_base+1996>, syscall_num=21) at bgrt/kernel/syscall.c:172 172 asm volatile ("bkpt #2"); => 0x0800d074 <main.isra.0+1004>: be02 bkpt 0x0002 (gdb) bt

0 bgrt_priv_do_syscall (syscall_arg=0x20002a3c <memp_memory_POOL_1024_base+1996>, syscall_num=21)
at bgrt/kernel/syscall.c:172
1 _do_int_scall (kblock=) at bgrt/kernel/kernel.c:92

2 bgrt_kblock_do_work (kblock=) at bgrt/kernel/kernel.c:185

3 bgrt_kblock_main (kblock=) at bgrt/kernel/kernel.c:215

4 bgrt_start () at bgrt/arch/cm4f/gcc/bugurt_port.c:221

5 main.isra.0 () at main.c:92

6 0x080019e0 in reset_handler () at libopencm3/lib/at32/f40x/../../cm3/vector.c:93

(gdb) p *bgrt_kernel.sched.current_proc $13 = { parent = { parent = { next = 0x20007f50 <memp_memory_bgrt_proc_base+208>, prev = 0x20017f10 }, list = 0x2000038c <bgrt_kernel+156>, prio = 31 }, flags = 99, base_prio = 31, lres = { counter = {0 <repeats 32 times>}, map = 0 }, time_quant = 1, timer = 1, sync = 0x0, cnt_lock = 28151, pmain = 0x800b6bd , sv_hook = 0x80089eb , rs_hook = 0x80089eb , arg = 0x20008afc , sstart = 0x20002a74 <memp_memory_POOL_1024_base+2052>, spointer = 0x200029e4 <memp_memory_POOL_1024_base+1908>, udata = { scarg = 0x20002a3c <memp_memory_POOL_1024_base+1996>, scnum = 21, scret = 0 } }

while most of proc fileds seem ok for me, proc.cnt_lock isn't -- and i have no bgrt_priv_proc_lock() at all. do you have any idea where this might come from ?

— Reply to this email directly, view it on GitHub https://github.com/shkolnick-kun/bugurtos/issues/43#issuecomment-2185305566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4JAXXH7PVJGQIHIPPPYHLZI4RNHAVCNFSM6AAAAABIZCNJJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBVGMYDKNJWGY . You are receiving this because you commented.Message ID: @.***>

shkolnick-kun commented 5 months ago

I mean scnum of 21 is error and zero scret in this case too, as we initiate scret with BGRT_ST_B4SC before process to kernel context switch...

пн, 24 июн. 2024 г., 14:50 Noname Nothing @.***>:

And yes, scnum of 21 is not with zero scret is another problem...

пн, 24 июн. 2024 г., 14:47 Noname Nothing @.***>:
Hi!

Can you define some small BGRT_CONFIG_CNT_MAX and define BGRT_CONFIG_TEST?

This can help to track the counter leak...

пн, 24 июн. 2024 г., 00:51 beefdeadbeef @.***>:
OK! I've studied the sync code. The code was written long time ago so I haven't read it for several years. The following quote of my previous post is incorrect:

Dirty means some process is going to sleep but it did not yet. So that's a bug in the current mutex implementation. I'm fixing it now.

In fact when some process calls bgrt_mtx_free the mutex is owned by the calling process so:

BGRT_ST_EEMPTY is returned when there are no processes waiting to lock the mutex;

and BGRT_ST_OK is returned when there is at least one waiting process which eventually locks the mutex.

So now I think that returning BGRT_ST_EEMPTY is normal...

For BGRT_ST_SCALL, it usually takes several hours of work under load until fail, not sure if i could prepare reasonably small test for that.

https://github.com/beefdeadbeef/ecmit/blob/master/mk/lwip/bgrt/sys_arch.c#L97

Can you help me with further investigation? I don't have the board which can run ecmit so I'll be glad if you start a gdb session on your target board with the following breakpoints set:

one ,

two ,

three .

The thirst one means some mutex related bug others mean something else. A call stack and some object states may be needed to trace the issue.

third breakpoint, just before return BGRT_ST_SCALL

Program received signal SIGTRAP, Trace/breakpoint trap. bgrt_priv_do_syscall (syscall_arg=0x20002a3c <memp_memory_POOL_1024_base+1996>, syscall_num=21) at bgrt/kernel/syscall.c:172 172 asm volatile ("bkpt #2"); => 0x0800d074 <main.isra.0+1004>: be02 bkpt 0x0002 (gdb) bt

0 bgrt_priv_do_syscall (syscall_arg=0x20002a3c <memp_memory_POOL_1024_base+1996>, syscall_num=21)
at bgrt/kernel/syscall.c:172
1 _do_int_scall (kblock=) at bgrt/kernel/kernel.c:92

2 bgrt_kblock_do_work (kblock=) at bgrt/kernel/kernel.c:185

3 bgrt_kblock_main (kblock=) at bgrt/kernel/kernel.c:215

4 bgrt_start () at bgrt/arch/cm4f/gcc/bugurt_port.c:221

5 main.isra.0 () at main.c:92

6 0x080019e0 in reset_handler () at libopencm3/lib/at32/f40x/../../cm3/vector.c:93

(gdb) p *bgrt_kernel.sched.current_proc $13 = { parent = { parent = { next = 0x20007f50 <memp_memory_bgrt_proc_base+208>, prev = 0x20017f10 }, list = 0x2000038c <bgrt_kernel+156>, prio = 31 }, flags = 99, base_prio = 31, lres = { counter = {0 <repeats 32 times>}, map = 0 }, time_quant = 1, timer = 1, sync = 0x0, cnt_lock = 28151, pmain = 0x800b6bd , sv_hook = 0x80089eb , rs_hook = 0x80089eb , arg = 0x20008afc , sstart = 0x20002a74 <memp_memory_POOL_1024_base+2052>, spointer = 0x200029e4 <memp_memory_POOL_1024_base+1908>, udata = { scarg = 0x20002a3c <memp_memory_POOL_1024_base+1996>, scnum = 21, scret = 0 } }

while most of proc fileds seem ok for me, proc.cnt_lock isn't -- and i have no bgrt_priv_proc_lock() at all. do you have any idea where this might come from ?

— Reply to this email directly, view it on GitHub https://github.com/shkolnick-kun/bugurtos/issues/43#issuecomment-2185305566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4JAXXH7PVJGQIHIPPPYHLZI4RNHAVCNFSM6AAAAABIZCNJJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBVGMYDKNJWGY . You are receiving this because you commented.Message ID: @.***>

beefdeadbeef commented 5 months ago

with BGRT_CONFIG_CNT_MAX defined to 32:

(gdb) bt
#0  0x08008936 in bgrt_priv_proc_lock () at bgrt/kernel/proc.c:290
#1  0x0800a4dc in _sem_lock_fsm (va=<optimized out>) at bgrt/libs/native/sem.c:149
#2  0x08008f46 in BGRT_SC_USER_SR (va=<optimized out>) at bgrt/kernel/default/syscall_routines.h:311
#3  0x0800cfa4 in bgrt_priv_do_syscall (syscall_arg=<optimized out>, syscall_num=<optimized out>) at bgrt/kernel/syscall.c:176
#4  _do_int_scall (kblock=<optimized out>) at bgrt/kernel/kernel.c:92
#5  bgrt_kblock_do_work (kblock=<optimized out>) at bgrt/kernel/kernel.c:185
#6  bgrt_kblock_main (kblock=<optimized out>) at bgrt/kernel/kernel.c:215
#7  bgrt_start () at bgrt/arch/cm4f/gcc/bugurt_port.c:221
#8  main.isra.0 () at main.c:89
#9  0x080019e0 in reset_handler () at libopencm3/lib/at32/f40x/../../cm3/vector.c:9

and BGRT_CNT_INC got inlined:

08008918 <bgrt_priv_proc_lock>:
 8008918:           4b07        ldr     r3, [pc, #28]   @ (8008938 <bgrt_priv_proc_lock+0x20>)
 800891a:           f8d3 3090   ldr.w   r3, [r3, #144]  @ 0x90
 800891e:           691a        ldr     r2, [r3, #16]
 8008920:           f042 0220   orr.w   r2, r2, #32
 8008924:           611a        str     r2, [r3, #16]
 8008926:           f8d3 20a8   ldr.w   r2, [r3, #168]  @ 0xa8
 800892a:           2a1f        cmp     r2, #31
 800892c:       /-- d803        bhi.n   8008936 <bgrt_priv_proc_lock+0x1e>
 800892e:       |   3201        adds    r2, #1
 8008930:       |   f8c3 20a8   str.w   r2, [r3, #168]  @ 0xa8
 8008934:       |   4770        bx      lr
 8008936:       +-- e7fe        b.n     8008936 <bgrt_priv_proc_lock+0x1e>
 8008938:           200007a8    .word   0x200007a8

shkolnick-kun commented 5 months ago

with BGRT_CONFIG_CNT_MAX defined to 32:

(gdb) bt
#0  0x08008936 in bgrt_priv_proc_lock () at bgrt/kernel/proc.c:290
#1  0x0800a4dc in _sem_lock_fsm (va=<optimized out>) at bgrt/libs/native/sem.c:149
#2  0x08008f46 in BGRT_SC_USER_SR (va=<optimized out>) at bgrt/kernel/default/syscall_routines.h:311
#3  0x0800cfa4 in bgrt_priv_do_syscall (syscall_arg=<optimized out>, syscall_num=<optimized out>) at bgrt/kernel/syscall.c:176
#4  _do_int_scall (kblock=<optimized out>) at bgrt/kernel/kernel.c:92
#5  bgrt_kblock_do_work (kblock=<optimized out>) at bgrt/kernel/kernel.c:185
#6  bgrt_kblock_main (kblock=<optimized out>) at bgrt/kernel/kernel.c:215
#7  bgrt_start () at bgrt/arch/cm4f/gcc/bugurt_port.c:221
#8  main.isra.0 () at main.c:89
#9  0x080019e0 in reset_handler () at libopencm3/lib/at32/f40x/../../cm3/vector.c:9

and BGRT_CNT_INC got inlined:

08008918 <bgrt_priv_proc_lock>:
 8008918:           4b07        ldr     r3, [pc, #28]   @ (8008938 <bgrt_priv_proc_lock+0x20>)
 800891a:           f8d3 3090   ldr.w   r3, [r3, #144]  @ 0x90
 800891e:           691a        ldr     r2, [r3, #16]
 8008920:           f042 0220   orr.w   r2, r2, #32
 8008924:           611a        str     r2, [r3, #16]
 8008926:           f8d3 20a8   ldr.w   r2, [r3, #168]  @ 0xa8
 800892a:           2a1f        cmp     r2, #31
 800892c:       /-- d803        bhi.n   8008936 <bgrt_priv_proc_lock+0x1e>
 800892e:       |   3201        adds    r2, #1
 8008930:       |   f8c3 20a8   str.w   r2, [r3, #168]  @ 0xa8
 8008934:       |   4770        bx      lr
 8008936:       +-- e7fe        b.n     8008936 <bgrt_priv_proc_lock+0x1e>
 8008938:           200007a8    .word   0x200007a8

Thank you!

I've fixed _sem_lock_fsm.

beefdeadbeef commented 5 months ago

(gdb) r
Starting program: /home/lioka/src/mk/at32/f403/ecmit/fw.elf 

Program received signal SIGTRAP, Trace/breakpoint trap.
0x08009948 in sys_arch_mbox_fetch (msg=msg@entry=0x20002a68 <memp_memory_POOL_1024_base+2044>, timeout=39, mbox=<optimized out>) at mk/lwip/bgrt/sys_arch.c:218
218         LWIP_ASSERT("st == BGRT_ST_ROLL", st == BGRT_ST_ROLL);
   0x0800993c <sys_arch_mbox_fetch+80>: 4817        ldr r0, [pc, #92]   @ (0x800999c <sys_arch_mbox_fetch+176>)
   0x0800993e <sys_arch_mbox_fetch+82>: f8cd 9000   str.w   r9, [sp]
   0x08009942 <sys_arch_mbox_fetch+86>: 23da        movs    r3, #218    @ 0xda
   0x08009944 <sys_arch_mbox_fetch+88>: f7f8 f8fc   bl  0x8001b40 <printf_>
=> 0x08009948 <sys_arch_mbox_fetch+92>: be00        bkpt    0x0000
(gdb) bt
#0  0x08009948 in sys_arch_mbox_fetch (msg=msg@entry=0x20002a68 <memp_memory_POOL_1024_base+2044>, timeout=39, mbox=<optimized out>) at mk/lwip/bgrt/sys_arch.c:218
#1  0x0800b630 in tcpip_timeouts_mbox_fetch (mbox=<optimized out>, msg=<optimized out>) at lwip/src/api/tcpip.c:104
#2  tcpip_thread (arg=<optimized out>) at lwip/src/api/tcpip.c:142
#3  0x080097ac in bgrt_syscall (num=<optimized out>, arg=0x8000c2f <blocking_handler>) at bgrt/arch/cm4f/gcc/bugurt_port.c:252
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)  p *bgrt_kernel.sched.current_proc
$1 = {
  parent = {
    parent = {
      next = 0x20007f1c <memp_memory_bgrt_proc_base+208>,
      prev = 0x20007fec <memp_memory_bgrt_proc_base+416>
    },
    list = 0x20000844 <bgrt_kernel+156>,
    prio = 31
  },
  flags = 67,
  base_prio = 31,
  lres = {
    counter = {0 <repeats 32 times>},
    map = 0
  },
  time_quant = 1,
  timer = 1,
  sync = 0x0,
  cnt_lock = 0,
  pmain = 0x800b5e1 <tcpip_thread>,
  sv_hook = 0x800891f <sys_arch_dummy_cb>,
  rs_hook = 0x800891f <sys_arch_dummy_cb>,
  arg = 0x200088f4 <sys_arch_dummy_arg>,
  sstart = 0x20002a70 <memp_memory_POOL_1024_base+2052>,
  spointer = 0x200029bc <memp_memory_POOL_1024_base+1872>,
  udata = {
    scarg = 0x20002a10 <memp_memory_POOL_1024_base+1956>,
    scnum = 21,
    scret = 8
  }
}

so, cnt_lock issue is gone indeed. as for the rest, i'll try to catch something useful later.

beefdeadbeef commented 4 months ago

Good news, everyone! I've managed to fix bug in my code (tldr i've f***d up systick/pendsv priorities) and now it works as intended. Thanks for your patience and ofc for bugurtos. --> two cups of tea here <--

shkolnick-kun commented 4 months ago

Good news, everyone! I've managed to fix bug in my code (tldr i've f***d up systick/pendsv priorities) and now it works as intended. Thanks for your patience and ofc for bugurtos. --> two cups of tea here <--

Good! You are welcome!