mcb30 / dw1000

DecaWave DW1000 Linux kernel driver
16 stars 8 forks source link

TX data message failed: -22 #14

Open jonathanmuller opened 5 years ago

jonathanmuller commented 5 years ago

Hi,

I have cross-compiled your driver for a Yocto-linux embedded device (with no errors). However, when trying to send frames, I get an error (-22). Do you have an idea from where this could come ?

Bellow more details on the procedure and error :

1) I load the cross-compiled DW1000 module 2) Enable wpan0 interface :

ip link set wpan0 down ip link set wpan0 address 76:dd:3c:55:1c:43:9a:aa iwpan phy0 set channel 4 2 iwpan dev wpan0 set pan_id 0xCAFE iwpan dev wpan0 set short_addr 0x0001 ip link set wpan0 up iwpan dev

3) Send frames (ping) :

wpan-ping -a 11:22:33:44:55:66 -i wpan0

4) No frames are sent or received by my other device, and here is the result of dmesg :

[ 38.952695] dw1000_mod: loading out-of-tree module taints kernel. [ 38.958584] dw1000 spi0.0: device does not support GPIO RESET control [ 38.958643] dw1000 spi0.0: device does not support GPIO POWER control [ 38.977691] Initialization [ 38.977917] dw1000 spi0.0: found model 1.3.0 [ 38.981413] dw1000 spi0.0: set xtal trim 0x0f [ 38.984084] dw1000 spi0.0: set LDOTUNE 08:88:88:88:88 [ 38.986927] dw1000 spi0.0: has no permanent EUI-64 address [ 38.988047] dw1000 spi0.0: set antenna delays 0x0000,0x0000 [ 38.992475] dw1000 spi0.0: I read dev_id : 0xdeca [ 51.479224] dw1000 spi0.0: set channel 2 [ 51.501790] dw1000 spi0.0: promiscuous mode disabled [ 51.501989] dw1000 spi0.0: set PAN ID cafe [ 51.502180] dw1000 spi0.0: set EUI64 76dd3c551c439aaa [ 51.502366] dw1000 spi0.0: set short address 0001 [ 51.503086] dw1000 spi0.0: started [ 58.799961] random: crng init done [ 67.537920] dw1000 spi0.0: TX data message failed: -22 [ 68.046570] dw1000 spi0.0: TX data message failed: -22 [ 68.556604] dw1000 spi0.0: TX data message failed: -22 [ 69.066571] dw1000 spi0.0: TX data message failed: -22 [ 69.576597] dw1000 spi0.0: TX data message failed: -22 ...

Thank you and have a good day

mcb30 commented 5 years ago

On 13/11/2018 13:39, jonathanmuller wrote:

  1. No frames are sent or received by my other device, and here is the result of dmesg :

    [ 38.952695] dw1000_mod: loading out-of-tree module taints kernel. [ 38.958584] dw1000 spi0.0: device does not support GPIO RESET control [ 38.958643] dw1000 spi0.0: device does not support GPIO POWER control [ 38.977691] Initialization [ 38.977917] dw1000 spi0.0: found model 1.3.0 [ 38.981413] dw1000 spi0.0: set xtal trim 0x0f [ 38.984084] dw1000 spi0.0: set LDOTUNE 08:88:88:88:88 [ 38.986927] dw1000 spi0.0: has no permanent EUI-64 address [ 38.988047] dw1000 spi0.0: set antenna delays 0x0000,0x0000 [ 38.992475] dw1000 spi0.0: I read dev_id : 0xdeca [ 51.479224] dw1000 spi0.0: set channel 2 [ 51.501790] dw1000 spi0.0: promiscuous mode disabled [ 51.501989] dw1000 spi0.0: set PAN ID cafe [ 51.502180] dw1000 spi0.0: set EUI64 76dd3c551c439aaa [ 51.502366] dw1000 spi0.0: set short address 0001 [ 51.503086] dw1000 spi0.0: started [ 58.799961] random: crng init done [ 67.537920] dw1000 spi0.0: TX data message failed: -22 [ 68.046570] dw1000 spi0.0: TX data message failed: -22 [ 68.556604] dw1000 spi0.0: TX data message failed: -22 [ 69.066571] dw1000 spi0.0: TX data message failed: -22 [ 69.576597] dw1000 spi0.0: TX data message failed: -22 ...

I haven't seen that before. The only possibility I can think of is that wpan-ping is sending a zero-length packet and the SPI controller is complaining about being asked to transmit a zero-length data fragment.

Does the error go away if you specify a packet length (using the "-s" option for wpan-ping)?

Michael

jonathanmuller commented 5 years ago

I used different length with wpan-ping, all with the same result. Also tried raw socket with no more success.

I verified if spi was working at all, and basic commands (reading/setting pan_id for example) works fine. This is really strange.

I'll keep you updated if I find anything related to this.

mcb30 commented 5 years ago

Which kernel module is providing the SPI controller driver in your setup?

jonathanmuller commented 5 years ago

I'm not exactly sure which exact information you are expecting, I hope it is one of those :

The kernel used is the Linux 4.11 Fearless Coyote, driver spi-imx.

The driver name is "spi_imx" (spi-imx.c) It's compatible fields are

{ .compatible = "fsl,imx1-cspi", .data = &imx1_cspi_devtype_data, }, { .compatible = "fsl,imx21-cspi", .data = &imx21_cspi_devtype_data, }, { .compatible = "fsl,imx27-cspi", .data = &imx27_cspi_devtype_data, }, { .compatible = "fsl,imx31-cspi", .data = &imx31_cspi_devtype_data, }, { .compatible = "fsl,imx35-cspi", .data = &imx35_cspi_devtype_data, }, { .compatible = "fsl,imx51-ecspi", .data = &imx51_ecspi_devtype_data, }, { .compatible = "fsl,imx53-ecspi", .data = &imx53_ecspi_devtype_data, },

File avaible here : https://github.com/torvalds/linux/blob/v4.11/drivers/spi/spi-imx.c

mcb30 commented 5 years ago

It should take only an hour or so to narrow down the root cause using ftrace (or by adding printk() statements to the spi-imx.c driver), given access to the hardware. There aren't that many code paths in spi-imx.c that can generate EINVAL.

jonathanmuller commented 5 years ago

The moment it fails is in "dw1000_tx()" when calling "spi_async()" Exact line is : "

if ((rc = spi_async(dw->spi, &tx->data)) != 0) goto err_spi;

Right before the call to spi_async() :

Right after the call to spi_async():

Right after spin_unlock_irqrestore():

I also verified and the buffer correspond to what I'm sending from my application (wpan-ping / raw socket) with the right len and content.

But dw1000_tx_data_complete() is called before spin_unlock_irqrestore(), and the check catches the same error -22 :

if (unlikely(tx->data.status != 0)) {
        dev_err(dw->dev, "Write to SPI failed. Error: %s\n", strerror(errno));
        goto err_status;
    }

I also verified and both dw->spi and dw->dev are not null. What surprises me is that they are the same (same pointer value), is it expected ?

So, in "spi_async(dw->spi, &tx->data))", either "dw->spi" is incorrect, or "&tx->data" is.

Still investigating, I'm open for suggestions.

mcb30 commented 5 years ago

On 20/11/2018 09:53, jonathanmuller wrote:

But dw1000_tx_data_complete() is called before spin_unlock_irqrestore(), and the check catches the same error -22 :

|if (unlikely(tx->data.status != 0)) { dev_err(dw->dev, "Write to SPI failed. Error: %s\n", strerror(errno)); goto err_status; } |

I also verified and both dw->spi and dw->dev are not null. What surprises me is that they are the same (same pointer value), is it expected ?

Yes, that's expected. dw->dev is set to &dw->spi->dev. Since "dev" is the first field within struct spi_device, they will end up with the same pointer value.

Still investigating, I'm open for suggestions.

I would suggest either:

Michael

jonathanmuller commented 5 years ago

I followed the path up the the EINVAL and it originates from "spi_map_buf()" in spi.c

The responsible code is :

#ifdef CONFIG_HAS_DMA
static int spi_map_buf(struct spi_controller *ctlr, struct device *dev,
               struct sg_table *sgt, void *buf, size_t len,
               enum dma_data_direction dir)
{
    const bool vmalloced_buf = is_vmalloc_addr(buf);
    printk("Tried to alloc gave %d\n", vmalloced_buf);
    unsigned int max_seg_size = dma_get_max_seg_size(dev);
#ifdef CONFIG_HIGHMEM
    const bool kmap_buf = ((unsigned long)buf >= PKMAP_BASE &&
                (unsigned long)buf < (PKMAP_BASE +
                    (LAST_PKMAP * PAGE_SIZE)));
#else
    printk("Set kmap_buf to false\n");
    const bool kmap_buf = false;
#endif
    int desc_len;
    int sgs;
    struct page *vm_page;
    struct scatterlist *sg;
    void *sg_buf;
    size_t min;
    int i, ret;

    if (vmalloced_buf || kmap_buf) {
        desc_len = min_t(int, max_seg_size, PAGE_SIZE);
        sgs = DIV_ROUND_UP(len + offset_in_page(buf), desc_len);
    } else if (virt_addr_valid(buf)) {
        desc_len = min_t(int, max_seg_size, ctlr->max_dma_len);
        sgs = DIV_ROUND_UP(len, desc_len);
    } else {
        return -EINVAL;
    }

vmalloced_buf is 0 kmap_buf is 0 virt_addr_valid(buf) is 0

When checking the kmap_buf with the following code (which only expand the test : const bool kmap_buf = ((unsigned long)buf >= PKMAP_BASE && (unsigned long)buf < (PKMAP_BASE + (LAST_PKMAP * PAGE_SIZE))) )

    printk("kmap_buf was set to sth : %d because %lu >= %lu\n", 
    kmap_buf, (unsigned long)buf, PKMAP_BASE && (unsigned long)buf < (PKMAP_BASE +(LAST_PKMAP * PAGE_SIZE))     );
    printk("Individual : %lu %lu %lu %lu %lu\n", PKMAP_BASE, (unsigned long)buf, PKMAP_BASE, LAST_PKMAP, PAGE_SIZE);
    printk("Step 1 %lu\n", LAST_PKMAP);
    printk("Step 2 %lu\n", (LAST_PKMAP * PAGE_SIZE) );
    printk("Step 3 %lu\n", (PKMAP_BASE +(LAST_PKMAP * PAGE_SIZE))   );
    printk("Step 4 %lu\n", (unsigned long)buf < (PKMAP_BASE +(LAST_PKMAP * PAGE_SIZE))  );
    printk("Step 5 %lu\n", PKMAP_BASE && (unsigned long)buf < (PKMAP_BASE +(LAST_PKMAP * PAGE_SIZE))    );

I get :

[   42.564013] kmap_buf was set to sth : 0 because 3990826368 >= 0
[   42.564018] Individual : 3219128320 3990826368 3219128320 512 4096
[   42.564021] Step 1 512
[   42.564028] Step 2 2097152
[   42.564031] Step 3 3221225472
[   42.564034] Step 4 0
[   42.564037] Step 5 0

This is the reason why kmap_buf=0, but I suspect it should not

mcb30 commented 5 years ago

On 20/11/2018 15:30, jonathanmuller wrote:

vmalloced_buf is 0 kmap_buf is 0 virt_addr_valid(buf) is 0

The TX data messages comprises multiple SPI transfer sets, each of which has a header transfer and a data transfer. Which of these transfers is triggering the failure?

Something like this (completely untested) in dw1000_tx():

/* Initiate data SPI message */
{
    struct spi_transfer *xfer;
    list_for_each_entry(xfer, &tx->data.transfers,
                  transfer_list) {
        printk("tx_buf %p rx_buf %p len %x\n",
               xfer->tx_buf, xfer->rx_buf, xfer->len);
    }
}
if ((rc = spi_async(dw->spi, &tx->data)) != 0)
    goto err_spi;

One of the tx_buf or rx_buf addresses should match the "buf" value shown in your existing debug code.

Michael

jonathanmuller commented 5 years ago

When called from dw1000_tx() : [ 71.957979] tx_buf ecf2a638 rx_buf (null) len 1 [ 71.957988] tx_buf ecc88c27 rx_buf (null) len f [ 71.957996] tx_buf ecf2a6b4 rx_buf (null) len 1 [ 71.958004] tx_buf bf00424b rx_buf (null) len 1 [ 71.958012] tx_buf ecf2a730 rx_buf (null) len 1 [ 71.958021] tx_buf ecf2a5c4 rx_buf (null) len 1 [ 71.958029] tx_buf ecf2a7ac rx_buf (null) len 1 [ 71.958037] tx_buf bf00424c rx_buf (null) len 1 [ 71.958044] tx_buf ecf2a828 rx_buf (null) len 1

When called from __spi_map_msg() (which is the function calling spi_map_buf() ) : [ 71.958157] tx_buf ecf2a638 rx_buf ecae95c0 len 1 [ 71.958167] tx_buf ecc88c27 rx_buf ecae95c0 len f [ 71.958175] tx_buf ecf2a6b4 rx_buf ecae95c0 len 1 [ 71.958183] tx_buf bf00424b rx_buf ecae95c0 len 1 [ 71.958192] tx_buf ecf2a730 rx_buf ecae95c0 len 1 [ 71.958199] tx_buf ecf2a5c4 rx_buf ecae95c0 len 1 [ 71.958208] tx_buf ecf2a7ac rx_buf ecae95c0 len 1 [ 71.958215] tx_buf bf00424c rx_buf ecae95c0 len 1 [ 71.958224] tx_buf ecf2a828 rx_buf ecae95c0 len 1 [ 71.958232] tx_buf ecae91c0 rx_buf ecf2a5c5 len 1

So ecf2a5c5 is probably the address of the null pointer

The when printing in spi_map_buf() to see which message fails : tx_buf ecf2a638 rx_buf ecae95c0 len 1 : Tx works Rx works (virt_addr_valid(buf)=1) tx_buf ecc88c27 rx_buf ecae95c0 len f : Tx works Rx works (virt_addr_valid(buf)=1) tx_buf ecf2a6b4 rx_buf ecae95c0 len 1 : Tx works Rx works (virt_addr_valid(buf)=1) tx_buf bf00424b rx_buf ecae95c0 len 1 : Tx FAILS (virt_addr_valid(buf)=0)

So it fails because 4th message address is not a valid virtual address

mcb30 commented 5 years ago

Could you try the patch at #15 to see if this fixes the problem?

Thanks,

Michael

jonathanmuller commented 5 years ago

Patch #15 fixes the issue, thank you ! Frames are sent/received by the kernel without problems.

I will set the same configuration on an other device and verify they are effectively sent/received and then close the issue.

mcb30 commented 5 years ago

Thank you. Could you also check that the "sensors" command is able to read a power supply voltage and operating temperature from the dw1000 chip? (The same bug was present on the hwmon code path.)

jonathanmuller commented 5 years ago

Sure ! If I don't make mistakes, iwpan only have "list" avaible and not "ops" option. Do you have some code to test that ? (else I'm just going to fill dw1000_hwmon_read() function with the right arguments)

mcb30 commented 5 years ago

Just install the "lm_sensors" package and run "/usr/bin/sensors": that should be all you need to test it.

jonathanmuller commented 5 years ago

[ 414.665010] dw1000 spi0.0: temperature 0x8d (0x7d @ 23degC) is 37035mdegC [ 437.639843] dw1000 spi0.0: voltage 0xae (0xaa @ 3.3V) is 3323mV Seems to work fine.

However I was not able to wind all the parameters required to establish the connection with an other device (I have direct SPI access to the other) :