raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.01k stars 4.95k forks source link

Images returned over SPI (2) are corrupt #2160

Closed stephenstarkie closed 6 years ago

stephenstarkie commented 7 years ago

We have a camera module on our own custom PCB connected via SPI (and i2c) to Pi CM 1's. Previously, we were getting good images from the camera; good images. Since updating the kernel ( 4.4.13+ to 4.9.35+ ) we now seem to get noisy images; noisy images I am running the same C++ code side by side on two identical systems (apart from kernel update which worked before the update). I have seen mention of other issues with SPI and have tried the suggestions for zeroing the spi struct but it doesn't seem to help here. I have also tried python code to get the images and have the same problem. Also; we have an FPGA on the PCB from which I can get the version number using;

#setup SPI
s=spidev.SpiDev()
s.open(2,0)
s.mode=0
s.max_speed_hz=100000

s.writebytes([0xb0]) # get version
print s.readbytes(1)

Running this on the 4.4 kernel always produces 1 (the correct version), while on the updated kernel it sometimes returns 0 and sometimes 128.

pelwell commented 7 years ago

It looks like you are using SPI2 (it would have been nice of you to mention that). How have you enabled that interface?

The AUX SPI interfaces are inferior to SPI0, with smaller FIFOs and no DMA - is there a reason why you aren't using it?

One of the changes in 4.9 was the addition of an AUX IRQ driver, but it seems unlikely that would cause a catastrophic failure like this. However, that change could be reveted with a DT overlay if necessary. Do you have a logic analyser to see what is happening on the bus?

stephenstarkie commented 7 years ago

Apologies, I didn't realise it would make a difference (still relatively new to this); I have added something to the title to make it clearer. I have enabled spi2 using;

dtoverlay=spi2-1cs in \boot\config.txt

The chip select line is on GPIO pin 43 so no parameters on the overlay.

The AUX SPI interfaces are inferior to SPI0, with smaller FIFOs and no DMA - is there a reason why you aren't using it?

I must say I don't know why we aren't using SPI0 (a choice made by an electronics engineer; who is on holiday at the moment, unfortunately!); though we do have a touch screen on it.

However, that change could be reveted with a DT overlay if necessary.

OK, that might be worth trying, how would I go about doing that? I have read up on overlays and how to compile them but am not very familiar with writing them. Or is there a parameter that I can use for this?

Do you have a logic analyser to see what is happening on the bus?

I will see what we can do.

pelwell commented 7 years ago

An overlay like this might be enough:

/dts-v1/;
/plugin/;

/ {
    compatible = "brcm,bcm2835", "brcm,bcm2708", "brcm,bcm2709";

    fragment@0 {
        target = <&aux>;
        __overlay__ {
            interrupts;
        };
    };

    fragment@1 {
        target = <&spi1>;
        __overlay__ {
            interrupt-parent = <&intc>;
            interrupts = <1 29>;
        };
    };

    fragment@2 {
        target = <&spi2>;
        __overlay__ {
            interrupt-parent = <&intc>;
            interrupts = <1 29>;
        };
    };

    fragment@3 {
        target = <&uart1>;
        __overlay__ {
            interrupt-parent = <&intc>;
            interrupts = <1 29>;
        };
    };
};

Save it as aux-spi-revert-overlay.dts, compile and install with:

dtc -@ -I dts -O dtb -o aux-spi-revert.dtbo arch/arm/boot/dts/overlays/aux-spi-revert-overlay.dts
sudo cp aux-spi-revert.dtbo /boot/overlays

Don't worry about the warnings from dtc.

That ought to disable the new AUX interrupt controller driver and make the individual AUX block drivers share the AUX interrupt directly.

pelwell commented 7 years ago

You can check if the overlay applies correctly using sudo vcdbg log msg. Running:

pi@raspberrypi:~$ grep 59: /proc/interrupts

should return:

 59:        583          0          0          0  ARMCTRL-level  61 Edge      serial

if it works (ignore the second number which varies), and:

 59:       3569          0          0          0  ARMCTRL-level  61 Edge      bcm2835-auxirq

if it doesn't.

stephenstarkie commented 7 years ago

I tried what you suggested about interrupts; sudo vcdbg log msg shows that the new overlay is loaded, but when I run grep 59: /proc/interrupts I get nothing.

cat /proc/interrupts produces;

           CPU0
 17:         20  ARMCTRL-level   1 Edge      2000b880.mailbox
 18:          2  ARMCTRL-level   2 Edge      VCHIQ doorbell
 27:    3892622  ARMCTRL-level  35 Edge      timer
 40:          0  ARMCTRL-level  48 Edge      bcm2708_fb dma
 42:          0  ARMCTRL-level  50 Edge      DMA IRQ
 44:          0  ARMCTRL-level  52 Edge      DMA IRQ
 45:          0  ARMCTRL-level  53 Edge      DMA IRQ
 48:          0  ARMCTRL-level  56 Edge      DMA IRQ
 53:        120  ARMCTRL-level  61 Edge      202150c0.spi
 56:   15577561  ARMCTRL-level  64 Edge      dwc_otg, dwc_otg_pcd, dwc_otg_hcd:usb1
 73:          0  ARMCTRL-level  81 Edge      20200000.gpio:bank0
 74:          0  ARMCTRL-level  82 Edge      20200000.gpio:bank1
 77:       1327  ARMCTRL-level  85 Edge      20205000.i2c, 20804000.i2c
 78:          6  ARMCTRL-level  86 Edge      20204000.spi
 81:        713  ARMCTRL-level  89 Edge      uart-pl011
 86:     269926  ARMCTRL-level  94 Edge      mmc0
195:          0  pinctrl-bcm2835  35 Edge      ads7846
FIQ:              usb_fiq
Err:          0

I have also looked at the signals using a logic analyser. There seems to be a problem with the camera FPGA sending back what should be a 1 but is being interpreted as a 128 or 0. However; I am seeing odd things coming from the FPGA, so before I add plots to this thread I would like to understand that a bit better as well. I will add plots once I understand them.

pelwell commented 7 years ago

OK - the assigned interrupt number (the first column) varies - it's the 4th column that matters. Try grep " 61 Edge" /proc/interrupts to see the difference.

stephenstarkie commented 7 years ago

So; yes it appears the overlay is being applied and stopping the interrupts. Without the overlay I get;

53: 0 ARMCTRL-level 61 Edge bcm2835-auxirq

With the overlay, I get;

53: 0 ARMCTRL-level 61 Edge 202150c0.spi

but both seem to give the same in terms of image quality coming back from the camera over SPI 2.

We tried a sanity check and moved everything to SPI0 on a development board with the same camera PCB. Initially this showed the problem as well. The logic analyser showed that we should be using rising edges on MOSI and falling edges on MISO, so I tried mode 1 on SPI0 and that fixed it. So, it appears there was a problem with our code (sorry about that!) However, we gleefully moved back to SPI2 with mode 1 and found the same problems (which stopped short our celebration!).

The following is a plot from trying to send a b0 to the FPGA (using python with mode 0) on 4.4, SPI 2.0 which should return the version number; 1, and does consistently (with any other mode the plots look very wrong, as expected);

spi2mode1-152

However, this is the plot doing the same on 4.9 (mode 0 or 1), which returns 0 or 128 (generally 128 when the FPGA leaves the MISO line high after transmission and 0 when it leaves it low (I'll talk to the FPGA code author about that, or learn some verilog and fix it!);

spi2mode1-152

So, it looks like there is something up in kernel 4.9, but slightly different from first thought. Of course, my assumption is that its my code

pelwell commented 7 years ago

Interesting. I'm not an SPI expert, but my interpretation of those traces is as follows:

  1. The clock line is normally low, so polarity/CPOL=0.
  2. MOSI changes on the trailing (falling for polarity 0) edge of SCL to be sampled on the leading (rising for polarity 0) edge, which corresponds to phase/CPHA=0.
  3. MISO changes on the leading edge of SCL, to be sampled on the trailing edge, i.e. phase/CPHA=1.

I don't know about other SPI controllers (and the SPI standard), but the controller in the Pi SoCs has a common CPHA for MOSI and MISO, so your trace is going to be problematic. Sampling (or changing) on the wrong edge would account for the 0x00/0x80 results. Are you using the same CPOL/CPHA for host and device?

And an observation - the CE/CS signal is being deasserted between the address write and the data read. Have you have set the cs_change flag to 1? This is rarely what you want - an older version of SPI-Py did this and caused problems with an RFID reader. See https://github.com/raspberrypi/linux/issues/1547#issuecomment-230170202 for an explanation.

None of this really accounts for the kernel-related differences, but it is questionable.

stephenstarkie commented 7 years ago

Thanks for all the help so far. Andy (electronics) is back (I think you might know him!) and is going to pick up the FPGA code; so we could probably change it to work with the same CPOL/CPHA on MOSI and MISO. I also changed the code to clock out and back chunks of data which means we don't de-assert CE/CS between write and read. I'll report the findings here.

JamesH65 commented 6 years ago

@stephenstarkie Do you have any results? Is this issue now able to be closed?

stephenstarkie commented 6 years ago

We are still trying to sort this one out. We think changing the FPGA code will sort it and we think that probably the raspberry pi SPI behaviour is correct. However there are a couple of things we are still checking. I hope we can bottom out and close this in the next day or so

JamesH65 commented 6 years ago

Thanks for the update, please close or update once you are ready.

JamesH65 commented 6 years ago

@stephenstarkie Any progress on this?

stephenstarkie commented 6 years ago

In the end we moved to SPI0, because we also moved the screen controller (new screen) to i2c, so our problem is resolved