hathach / tinyusb

An open source cross-platform USB stack for embedded system
https://www.tinyusb.org
MIT License
4.91k stars 1.03k forks source link

Stack overflow with the new NCM driver #2711

Open showier-drastic opened 2 months ago

showier-drastic commented 2 months ago

Operating System

Linux

Board

Custom ESP32 S3

Firmware

My own, but this is not relevant - will explain below

What happened ?

The new USB NCM driver - introduced in #2227 - is causing a stack overflow.

It causes the problem like this:

tud_network_recv_renew will call into recv_transfer_datagram_to_glue_logic. recv_transfer_datagram_to_glue_logic will in turn call tud_network_recv_cb if there's data - and when the cb called, the recv_glue_ntb is not cleared yet.

The problem is that some tud_network_recv_cb, like the one in esp_tinyusb, will unconditionally call tud_network_recv_renew. So the whole thing happens again and again, causing infinite recursion.

So, it's either Espressif will need to change their code ( @ESP-Coco ) and documentation added in TinyUSB to prohibit calling tud_network_recv_renew directly within tud_network_recv_cb, or the logic within recv_transfer_datagram_to_glue_logic need to be changed to prevent this infinite recursion.

@HiFiPhile @rgrr could I ask for your opinions on this?

How to reproduce ?

If you are not using ESP32, call tud_network_recv_renew directly within tud_network_recv_cb like the ESP code mentioned above.

If you are using ESP32 with ESP-IDF, you can try to replace the src/class/net folder of the tinyusb component to the latest master branch of tinyusb.

Debug Log as txt file (LOG/CFG_TUSB_DEBUG=2)

1.txt

Screenshots

No response

I have checked existing issues, dicussion and documentation

showier-drastic commented 2 months ago

I think the current situation on reception handling is confusing and probably wrong. tud_network_recv_renew seems to be called twice in a single packet reception, the first time by tud_network_recv_renew_r which is in turn called by netd_xfer_cb, and another time by user logic when it finished processing the current packet. This also means recv_try_to_start_new_reception is called twice in a single reception.

showier-drastic commented 2 months ago

OK, I understand that we may process multiple NCM "subpackets" in one callback, and that's why we need to repeatedly call tud_network_recv_renew.

So, I propose a solution to this issue to add re-entrance detection logic to tud_network_recv_renew function.

void tud_network_recv_renew(void) {
  TU_LOG_DRV("tud_network_recv_renew()\n");

  static bool processing = false;
  static bool should_process = false;

  should_process = true;
  if (processing) {
    TU_LOG_DRV("Re-entrant into tud_network_recv_renew, will process later\n");
    return;
  }

  while (should_process) {
    should_process = false;
    processing = true;
    // If the current function is called within recv_transfer_datagram_to_glue_logic,
    // should_process will become true, and the loop will run again
    // Otherwise the loop will not run again
    recv_transfer_datagram_to_glue_logic();
    processing = false;
  }
  recv_try_to_start_new_reception(ncm_interface.rhport);
} // tud_network_recv_renew
rgrr commented 2 months ago

Hmmm... not getting the point. Do you think the problem is in the NCM driver or in the glue code?

Perhaps my own glue code could be of help?: https://github.com/rgrr/yapicoprobe/blob/master/src/net/net_glue.c

showier-drastic commented 2 months ago

I think it's in the NCM driver, which cause compatibility issues with older codes that calls tud_network_recv_renew directly within tud_network_recv_cb. And this can be fixed by my function above.

rgrr commented 2 months ago

ok... understood. Good point. Loop above seems to be too complicated for my eyes. Is something like this working:

void tud_network_recv_renew(void) {
  TU_LOG_DRV("tud_network_recv_renew()\n");

  static bool should_process = false;    // should go into ncm_interface_t

  if (should_process) {
    TU_LOG_DRV("Re-entrant into tud_network_recv_renew, will process later\n");
    return;
  }
  should_process = true;

  while (should_process) {
    should_process = false;
    // If the current function is called within recv_transfer_datagram_to_glue_logic,
    // should_process will become true, and the loop will run again
    // Otherwise the loop will not run again
    recv_transfer_datagram_to_glue_logic();
  }
  recv_try_to_start_new_reception(ncm_interface.rhport);
} // tud_network_recv_renew

Or will it end up again in an endless loop?

showier-drastic commented 2 months ago

No, this will not work, because when you call recv_transfer_datagram_to_glue_logic(), should_process is false, so the if (should_process) block won't be executed. Keep in mind that the second call to tud_network_recv_renew is inside recv_transfer_datagram_to_glue_logic.

rgrr commented 2 months ago

Hmmm... you are right (perhaps I need some time to rethink it ;-))

Perhaps a simple semaphore could do?:

void tud_network_recv_renew(void) {
  static bool sema = false;

  if ( !sema) {
    sema = true;

    TU_LOG_DRV("tud_network_recv_renew()\n");

    recv_transfer_datagram_to_glue_logic();
    sema = false;
  }
  recv_try_to_start_new_reception(ncm_interface.rhport);
} // tud_network_recv_renew
showier-drastic commented 2 months ago

This will not work when there are 3 ntbs handled in one call to netd_xfer_cb.

I think either a loop or a recursion is needed in case tud_network_recv_renew is called directly within tud_network_recv_cb. One call to recv_transfer_datagram_to_glue_logic only process on ntb. As we need to process multiple ntbs, it's obvious that multiple calls to recv_transfer_datagram_to_glue_logic is needed. We don't want recursions (because that might cause stack overflow on MCUs), so this means we need loops.

rgrr commented 2 months ago

you are right (again)... perhaps it could be solved by letting recv_transfer_datagram_to_glue_logic() return an indication that some work has been done (or that there is some work still waiting)?

showier-drastic commented 2 months ago

Yes, but I think should_process in my code is this indication.

rgrr commented 2 months ago

got that, but to be honest I'm not happy with two variables for this logic.

rgrr commented 2 months ago

Could you please check PR #2713