linux-can / can-utils

Linux-CAN / SocketCAN user space applications
2.36k stars 708 forks source link

canfdtest inject millisecond delay - high-speed are not tested at full bitrate #132

Open LucileQ opened 5 years ago

LucileQ commented 5 years ago

Hello,

When setting high speed bitrate (such as 500000 or 1Mbps), the actual bitrate between the dut and the generator caps around 208000 bps.

I don't understand why the canfdtest utility inject a millisecond both the generator and the dut. Could I remove this hardcoded behavior ?

On the dut run: canfdtest -v can0 On the Workstation run: time -p canfdtest -l 20000 -g -v can0

# canfdtest exchanges each message twice (in both direction) # messages are 8-bytes long (120 bits per CAN message) Then the actual bitrate is = 20000 120 2 / ${time}

marckleinebudde commented 5 years ago

The program is about testing full duplex, with sending and receiving frames interleaved. The program uses sleeps for this. Use cangen instead.

vladimiroltean commented 4 years ago

What part of CAN exactly is full duplex?

marckleinebudde commented 4 years ago

Hey @vladimiroltean,

the CAN bus itself is half duplex, only one CAN controller can send at a time. However the driver can pass a CAN frame for sending into the controller at any time (even if the bus is currently used by another controller). The CAN frame is delayed and tried to send when the bus is free automatically by the controller. This means from the drivers perspective and thus from the user space point of view the CAN bus is as it were full duplex.

The mentioned test (canfdtest) sends and receives CAN frames interleaved with various delays between rx and tx, testing that the controller and driver doesn't reorder frames.

vladimiroltean commented 4 years ago

The CAN frame is delayed and tried to send when the bus is free automatically by the controller.

All other half-duplex buses (including Ethernet) work in the same way. That doesn't make them full duplex though.

vladimiroltean commented 4 years ago

The mentioned test (canfdtest) sends and receives CAN frames interleaved with various delays between rx and tx, testing that the controller and driver doesn't reorder frames.

By the way, if canfdtest fails like this (on the flexcan driver), where should I start looking to debug this issue?

interface = can1, family = 29, type = 3, proto = 1
Databyte 0 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 1 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 2 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 3 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 4 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 5 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 6 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 7 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53

Test messages sent and received: 3147

When this happens, the driver statistics look to me like they're pretty "clean":

Statistics for run 1365:
9: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10
    link/can  promiscuity 0 
    can state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 
      bitrate 1000000 sample-point 0.750 
      tq 10 prop-seg 42 phase-seg1 32 phase-seg2 25 sjw 1
      flexcan: tseg1 2..96 tseg2 2..32 sjw 1..16 brp 1..1024 brp-inc 1
      flexcan: dtseg1 2..39 dtseg2 2..8 dsjw 1..4 dbrp 1..1024 dbrp-inc 1
      clock 200000000
      re-started bus-errors arbit-lost error-warn error-pass bus-off
      0          0          0          0          0          0         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
    RX: bytes  packets  errors  dropped overrun mcast   
    635354784  79419348 0       0       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    317677392  39709674 0       0       0       0       
10: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10
    link/can  promiscuity 0 
    can state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0 
      bitrate 1000000 sample-point 0.750 
      tq 10 prop-seg 42 phase-seg1 32 phase-seg2 25 sjw 1
      flexcan: tseg1 2..96 tseg2 2..32 sjw 1..16 brp 1..1024 brp-inc 1
      flexcan: dtseg1 2..39 dtseg2 2..8 dsjw 1..4 dbrp 1..1024 dbrp-inc 1
      clock 200000000
      re-started bus-errors arbit-lost error-warn error-pass bus-off
      0          0          0          0          0          0         numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
    RX: bytes  packets  errors  dropped overrun mcast   
    635354784  79419348 0       0       0       0       
    TX: bytes  packets  errors  dropped carrier collsns 
    317677392  39709674 0       0       0       0       
marckleinebudde commented 4 years ago

The CAN frame is delayed and tried to send when the bus is free automatically by the controller.

All other half-duplex buses (including Ethernet) work in the same way. That doesn't make them full duplex though.

Not quite, modern Ethernet (using twisted pair and switches or direct connections) is usually full duplex. Old Ethernet with (with BNC cables) is half duplex and the CSMA/CD while the CAN bus is CSMA/CA.

marckleinebudde commented 4 years ago

By the way, if canfdtest fails like this (on the flexcan driver), where should I start looking to debug this issue?

One CAN controller is flexcan, what is the other controller?

interface = can1, family = 29, type = 3, proto = 1
Databyte 0 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 1 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 2 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 3 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 4 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 5 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 6 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53
Databyte 7 mismatch !
expected: 0078: [8] 4b 4c 4d 4e 4f 50 51 52
received: 0078: [8] 4c 4d 4e 4f 50 51 52 53

Test messages sent and received: 3147

This means one CAN frame was lost or the order was changed. You have to figure out if it happened on the generator or the reflector.

vladimiroltean commented 4 years ago

The CAN frame is delayed and tried to send when the bus is free automatically by the controller.

All other half-duplex buses (including Ethernet) work in the same way. That doesn't make them full duplex though.

Not quite, modern Ethernet (using twisted pair and switches or direct connections) is usually full duplex. Old Ethernet with (with BNC cables) is half duplex and the CSMA/CD while the CAN bus is CSMA/CA.

Of course I was comparing CAN with half-duplex Ethernet and not with full-duplex... The point about CSMA/CD vs CSMA/CA is valid though.

vladimiroltean commented 4 years ago

One CAN controller is flexcan, what is the other controller?

FlexCAN on both sides.

This means one CAN frame was lost or the order was changed. You have to figure out if it happened on the generator or the reflector.

How can I do that?

marckleinebudde commented 4 years ago

FlexCAN on both sides.

Which Soc? Which kernel?

How can I do that?

Try running candump any,0:0,#FFFFFFFF -extA on both sides and have a look at the order.

vladimiroltean commented 4 years ago

Which Soc?

NXP LS1028A

Which kernel?

Currently something called LSDK 20.04: https://source.codeaurora.org/external/qoriq/qoriq-components/linux/tree/?h=LSDK-20.04-V5.4

Try running candump any,0:0,#FFFFFFFF -extA on both sides and have a look at the order.

I'll test again and have that running in the background, thanks. It takes a while to reproduce though.

marckleinebudde commented 4 years ago

Currently something called LSDK 20.04:

Okay, some NXP vendor kernel.

As the flexcan driver only uses one mailbox for TX, there'll be no reordering on the TX side. For RX it uses timestamps on the mailboxes, which is also fine. What load do you have on the system, while running the test? Maybe a dropped frame on the socket.

vladimiroltean commented 4 years ago

What load do you have on the system, while running the test? Maybe a dropped frame on the socket.

A stress --cpu 2 --hdd 2, actually. Could you please expand a bit more about the ways in which CPU load affects this application?

marckleinebudde commented 4 years ago

If canfdtest isn't scheduled the RX-queue in the socket will grow, because it cannot receive CAN frames as fast as they are received by the flexcan core. The socket queue will grow and eventually the kernel will drop CAN frames from that queue. That might explain your problem here.

vladimiroltean commented 4 years ago

Thing is, stress --cpu only does this, if you look at the source code:

int
hogcpu (void)
{
  while (1)
    sqrt (rand ());

  return 0;
}

That shouldn't pose much of an issue to the scheduler, and neither should stress --hdd:

int
hoghdd (long long bytes)
{
  long long i, j;
  int fd;
  int chunk = (1024 * 1024) - 1;        /* Minimize slow writing.  */
  char buff[chunk];

  /* Initialize buffer with some random ASCII data.  */
  dbg (stdout, "seeding %d byte buffer with random data\n", chunk);
  for (i = 0; i < chunk - 1; i++)
    {
      j = rand ();
      j = (j < 0) ? -j : j;
      j %= 95;
      j += 32;
      buff[i] = j;
    }
  buff[i] = '\n';

  while (1)
    {
      char name[] = "./stress.XXXXXX";

      if ((fd = mkstemp (name)) == -1)
        {
          err (stderr, "mkstemp failed: %s\n", strerror (errno));
          return 1;
        }

      dbg (stdout, "opened %s for writing %lli bytes\n", name, bytes);

      dbg (stdout, "unlinking %s\n", name);
      if (unlink (name) == -1)
        {
          err (stderr, "unlink of %s failed: %s\n", name, strerror (errno));
          return 1;
        }

      dbg (stdout, "fast writing to %s\n", name);
      for (j = 0; bytes == 0 || j + chunk < bytes; j += chunk)
        {
          if (write (fd, buff, chunk) == -1)
            {
              err (stderr, "write failed: %s\n", strerror (errno));
              return 1;
            }
        }

      dbg (stdout, "slow writing to %s\n", name);
      for (; bytes == 0 || j < bytes - 1; j++)
        {
          if (write (fd, &buff[j % chunk], 1) == -1)
            {
              err (stderr, "write failed: %s\n", strerror (errno));
              return 1;
            }
        }
      if (write (fd, "\n", 1) == -1)
        {
          err (stderr, "write failed: %s\n", strerror (errno));
          return 1;
        }
      ++j;

      dbg (stdout, "closing %s after %lli bytes\n", name, j);
      close (fd);
    }

  return 0;
}

I would like to test your hypothesis though. What are some CAN socket statistics counters I can check?