Closed sonelu closed 2 years ago
Can you confirm my understanding that a self-built rpi-5.10.y (not upstream 5.10), the only change being a forward-port of the 4.19 sc16is7xx driver, behaves much better than an equivalent "official" rpi-5.10.y build?
And is that the sc16is7xx driver as found in rpi-4.19.y or the unmodified upstream version?
A fork of rpi-5.10.y with your patches would be ideal for testing.
Yes, it is the rpi-4.19.y version with only a few line updates to account for the changes related to struct sched_param
as implemented here in the rpi-5.10.y version.
The actual code is here.
A fork of rpi-5.10.y with your patches would be ideal for testing.
If the patch is between the forward-ported version (in the link above) and the current rpi-5.10.y then all the updates done in the last year to the driver in 5.10.y will be gone. I can do that but is that really useful?
Thanks - there are 15 commits that separate rpi-5.10.y from rpi-4.19.y:
911edccf93d0 sc16is7xx: Defer probe if device read fails
94a6e1d70b58 sc16is7xx: Fix for hardware flow control
5da6b1c079e6 sc16is7xx: Set iobase to device index
#28d2f209cd16 sched,serial: Convert to sched_set_fifo()
9eb90d57b55a sc16is7xx: Add flag to activate IrDA mode
2aa916e67db3 sc16is7xx: Read the LSR register for basic device presence check
2d12fc792cdd sc16is7xx: Allow sharing the IRQ line
6393ff1c4435 sc16is7xx: Use threaded IRQ
37f3965d74d5 sc16is7xx: Always use falling edge IRQ
4a37c0fcf5d4 serial: sc16is7xx: Replace zero-length array with flexible-array member
c53051128bb0 sc16is7xx: put err_spi and err_i2c into correct #ifdef
6b77908fea62 serial: sc16is7xx: Drop of_match_ptr() use
00d9f0c65401 serial: sc16is7xx: Switch to use device_get_match_data()
#24bc6e68efa0 serial: sc16is7xx: Respect clock-frequency property
84f1c5c0174a sc16is7xx: Use struct_size() in devm_kzalloc()
The commented out lines cause merge conflicts when reverted and are unlikely to be the cause of the problem. so I propose we ignore them for now.
If you don't mind doing some tests yourself - you are set up for it, whereas I'll struggle to scrape together the hardware - try this, after first getting the latest rpi-5.10.y - I'm on b657cd2f27d9:
diffs
- in the root of your rpi-5.10.y tree would be sensible.cut -c1-12 diffs | grep -v '#' | xargs git revert
git diff --name-only HEAD~13
. You can see the remaining differences with git diff origin/rpi-4.19.y -- drivers/tty/serial/sc16is7xx.c
.git bisect good
(answer 'y' or hit return).grep -v '#' diffs | tac | cut -c1-12 | xargs git cherry-pick
git diff origin/rpi-5.10.y
- should be none.git bisect bad
git bisect good
or git bisect bad
as appropriate. You should only need to run 3-4 tests to find the culprit.In case you haven't started yet, you get a slightly more comprehensible git log if you change step 6 from:
git log --oneline | head -13 | cut -c1-12 | xargs git revert
to:
grep -v '#' diffs | tac | cut -c1-12 | xargs git cherry-pick
I've edited the instructions above to avoid confusion.
Sorry for taking so long: I had trashed the kernel yesterday because of a mistake and had to rebuild from a backup from last week, plus I had some issues with the SD card and needed to change it.
Anyway, I will be doing the analysis above. Just that I noticed that if I try to replace the driver hot with:
$ sudo modprobe -r sc16is7xx
$ sudo cp <local> /lib/modules/.../serial/sc16is7xx.ko
$ sudo modprobe sc16is7xx
I get this in the dmesg
:
[ 2381.687987] irq: type mismatch, failed to map hwirq-23 for gpio@7e200000!
[ 2381.691676] sc16is7xx: probe of spi0.1 failed with error -22
Seems like the driver is not releasing the IRQ and when reinserting and it fails to reacquire it.
Should I raise this as a separate issue?
Should I raise this as a separate issue?
Yes please - just a brief issue. It will end up being a separate patch, ideally upstreamed, and having a separate issue makes it easier to track/harder to forget.
This is where the bisect analysis pointing to:
commit ee0be50a73e0fd49100583c9b0b2307c59db2005
Author: Daniel Mack <daniel@zonque.org>
Date: Thu May 21 11:11:50 2020 +0200
sc16is7xx: Use threaded IRQ
Use a threaded IRQ handler to get rid of the irq_work kthread.
This also allows for the driver to use interrupts generated by
a threaded controller.
Signed-off-by: Daniel Mack <daniel@zonque.org>
Link: https://lore.kernel.org/r/20200521091152.404404-5-daniel@zonque.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
drivers/tty/serial/sc16is7xx.c | 18 +++++-------------
1 file changed, 5 insertions(+), 13 deletions(-)
Thanks - that's a great help, although it suggests the problem is not going to be so easy to solve properly.
Can you try just with that one commit reverted? It isn't a clean revert, so revert all those local patches (git reset --hard origin/rpi-5.10.y
), download and apply this patch (git am <filename>
):
From 069db9c083b5eeed1773ab83510158766af1c431 Mon Sep 17 00:00:00 2001
From: Phil Elwell <phil@raspberrypi.com>
Date: Wed, 19 May 2021 17:15:38 +0100
Subject: [PATCH] Revert "sc16is7xx: Use threaded IRQ"
This reverts commit 6393ff1c4435acc343b1481f5b834b918cb42b12.
---
drivers/tty/serial/sc16is7xx.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/drivers/tty/serial/sc16is7xx.c b/drivers/tty/serial/sc16is7xx.c
index 8e3b83fac33c..c5a65b89d26b 100644
--- a/drivers/tty/serial/sc16is7xx.c
+++ b/drivers/tty/serial/sc16is7xx.c
@@ -328,6 +328,7 @@ struct sc16is7xx_port {
unsigned char buf[SC16IS7XX_FIFO_SIZE];
struct kthread_worker kworker;
struct task_struct *kworker_task;
+ struct kthread_work irq_work;
struct mutex efr_lock;
struct sc16is7xx_one p[];
};
@@ -713,9 +714,9 @@ static bool sc16is7xx_port_irq(struct sc16is7xx_port *s, int portno)
return true;
}
-static irqreturn_t sc16is7xx_irq(int irq, void *dev_id)
+static void sc16is7xx_ist(struct kthread_work *ws)
{
- struct sc16is7xx_port *s = (struct sc16is7xx_port *)dev_id;
+ struct sc16is7xx_port *s = to_sc16is7xx_port(ws, irq_work);
mutex_lock(&s->efr_lock);
@@ -730,6 +731,13 @@ static irqreturn_t sc16is7xx_irq(int irq, void *dev_id)
}
mutex_unlock(&s->efr_lock);
+}
+
+static irqreturn_t sc16is7xx_irq(int irq, void *dev_id)
+{
+ struct sc16is7xx_port *s = (struct sc16is7xx_port *)dev_id;
+
+ kthread_queue_work(&s->kworker, &s->irq_work);
return IRQ_HANDLED;
}
@@ -1240,6 +1248,7 @@ static int sc16is7xx_probe(struct device *dev,
mutex_init(&s->efr_lock);
kthread_init_worker(&s->kworker);
+ kthread_init_work(&s->irq_work, sc16is7xx_ist);
s->kworker_task = kthread_run(kthread_worker_fn, &s->kworker,
"sc16is7xx");
if (IS_ERR(s->kworker_task)) {
@@ -1338,16 +1347,14 @@ static int sc16is7xx_probe(struct device *dev,
* In case the interrupt controller doesn't support that, we fall
* back to a non-shared falling-edge trigger.
*/
- ret = devm_request_threaded_irq(dev, irq, NULL, sc16is7xx_irq,
- IRQF_TRIGGER_LOW | IRQF_SHARED |
- IRQF_ONESHOT,
- dev_name(dev), s);
+ ret = devm_request_irq(dev, irq, sc16is7xx_irq,
+ IRQF_TRIGGER_LOW | IRQF_SHARED | IRQF_ONESHOT,
+ dev_name(dev), s);
if (!ret)
return 0;
- ret = devm_request_threaded_irq(dev, irq, NULL, sc16is7xx_irq,
- IRQF_TRIGGER_FALLING | IRQF_ONESHOT,
- dev_name(dev), s);
+ ret = devm_request_irq(dev, irq, sc16is7xx_irq,
+ IRQF_TRIGGER_FALLING, dev_name(dev), s);
if (!ret)
return 0;
--
2.25.1
If you find that doesn't work, the next step is to delete/comment out the first call to devm_request_irq
(and the following if (!ret) return 0;
).
It's interesting:
putting the patch in full over the revered head will freeze the Raspberry Pi when I start the program that uses the UARTs. I had to physically turn the Pi off and on.
if I comment out the first call to devm_request_thereaded_irq
the driver runs but it's somewhere in between the "good" running and the "bad" running. Here is what I mean: the screenshoot shows the communication statistics (thousands of packets and %errors) when the driver works fine (ex. rpi-4.19.y):
"lower_body" are connected to ttySC1
and "upper_body" are connected to ttySC0
.
When the performance is bad (ex. rpi-5.10.y) it looks like this:
After applying the patch and commenting that lines the stats are like this:
Not sure if this is useful, but at the moment all communication is performed in a single thread that is sequentially executing first the actions on the "upper_body", then on "lower_body".
Do we have any news on this topic? I'm having some very similar problem on RPi3+ (aarch64), compared to RPi1 (arm).
Working: RPi1 (CM):
Linux rpi1 5.10.63+ #1457 Tue Sep 28 11:24:51 BST 2021 armv6l GNU/Linux
Not-working: RPi3 (CM): (missing bytes)
Linux rpi3 5.10.63-v8+ #1488 SMP PREEMPT Thu Nov 18 16:16:16 GMT 2021 aarch64 GNU/Linux
ping
@pelwell the is also #4711. br Paul
Investigating problems for random devices not used by Pi products is undertaken when time allows and the spirit moves us.
The sc16is7xx driver in rpi-5.10.y is very close to the upstream driver, the only differences being:
94a6e1d70b58 sc16is7xx: Fix for hardware flow control
b59e97d71a29 sc16is7xx: Don't spin if no data received
You can try reverting them and see if the matter improves. If not, it's an upstream issue.
P.S. The recent improvements in the upstream kernel have come from @mcdermj, who may have some insight on the problem.
Investigating problems for random devices not used by Pi products is undertaken when time allows and the spirit moves us.
The sc16is7xx driver in rpi-5.10.y is very close to the upstream driver, the only differences being:
94a6e1d70b58 sc16is7xx: Fix for hardware flow control b59e97d71a29 sc16is7xx: Don't spin if no data received
You can try reverting them and see if the matter improves. If not, it's an upstream issue.
I've tried driver approach without getting a solution, all my hints are to tty and kernel which is kind above what I can dive/try.
The only change I made upstream is to fix the probing of the driver. In the initial Raspberry Pi 5.10 release, the probe wasn't happening correctly on our board. This was because upstream added a check to see if the device was actually out there, where it never did before. But, instead of deferring the probe, they permanently failed it. This mean that on the Pi, where the i2c driver is a module, if the sc16is7xx driver loaded before the i2c driver, you'd never be able to probe the serial port. My change fixed that, and is present in upstream after the initial 5.10 release.
I'm not sure about performance issues. We probably wouldn't notice them in our board because we primarily use it to talk to a GPS unit. It runs on i2c and is at 4800/9600 baud, not the megabits/sec on SPI other folks here are talking about.
That being said, I'm encountering an issue with the December release where it appears that my serial port isn't working at all. I haven't had a chance to narrow this down to software/hardware yet, but hope to do so in the next couple of weeks.
@mcdermj: Thanks for the quick heads-up! Yes, I personally use it via SPI, and with at least 1MBps (hence the preference of SPI over I2C). 1MBps is only the current goal, we would go up to even 2 or 3, and we have 2 of these devices attached onto the PI, which would raise the bandwidth to even higher levels. I could definitely experience performance issues during firmware flashing onto a 3rd party device (which worked great on RPi1). However one thing is confusing me here: on my RPi1 it was 95% working with 5.10.63+. Is 5.10.y any newer than 5.10.63+?
See #4885 for a related issue, and a fixing patch (currently at https://github.com/raspberrypi/linux/commit/de9c45566dc4a14ac712a51015d1ee2f4b675f0a, but will move at the next rebase).
@pelwell: I absolutely love how great are you not just in fixes, but also in terms of information sharing. I would possibly never found that other thread, but now I also have the info. Thank you so much!
Earliest time for me to verify this will be around end of next week.
I will surely report my results here.
Hi folks! Well I am sure glad to see I am not the only one experiencing this issue. I have done a fair bit of debugging and I have some information to share. I have not solved it but I do think I have located a break in SPI communications that may be causing the problem.
Using Debian 10/Buster on a CM4 I went back through commits to the kernel and using rpi-update I found the point at which things broke on my system. It was the transition from 5.4 to 5.10 Specifically the last working version was: 5.4.83-v7l+ #1379 SMP Mon Dec 14 13:11:54 GMT 2020 armv7l GNU/Linux (via sudo rpi-update 43022f5adbbf5d6e05ea0e022a3090c2c9feff7c)
And the very next commit in the raspberrypi/rpi-firmware repo was: 5.10.1-v7l+ #1383 SMP Fri Dec 18 19:05:58 GMT 2020 armv7l GNU/Linux (via sudo rpi-update 367c84201db35a85979f1482ebd78e1eef910efc) This and all subsequent versions seem broken for me. I located the hashes on this page: https://github.com/raspberrypi/rpi-firmware/commits/master?after=9e99d81910722b6f2e086f6d14e592f292bcbb8b+139&branch=master
I may not have been reading the git history right, but it did not seem like there were any commits to the sc16is7xx driver in this time. But I cannot exactly tell because the breaking commit has thousands of new changes since it is a new kernel. However this makes me wonder if there is a broader problem with SPI and not just this driver.
I set up my test station so I can rapidly swap between good and bad boot directories and confirmed that is where things break.
I have been using a logic analyzer to look at transfers. Today I noticed that the problems begin when there is an apparent gap in SPI data transfer. I have arranged my bench test to always send the same data to the serial device such that it should always send the same data back. You can see in the following screenshots of the capture how at the beginning of the stream, things look well behaved. There is a pattern to the transfers as things are working properly. And then there is an error in the SPI data stream, an unexpected gap in SPI clocks, and when it resumes it does not pick up where it should, but picks up as if it had been clocking some data, and so the SPI transfer is garbled. In my case this seems to cause my sc16is752 to spew data continuously out the TX line for a while, and this ultimately breaks my communication protocol.
EDIT: Actually my reading of this transfer may be incorrect, since it is MISO not MOSI that has a change in bytes, and this may be due to the fact that RX data came in during the pause. Also the pause is not mid transfer as I was thinking, but in between transfers. I am still thinking this over but original message continues below.
Here you can see a zoomed out view, showing the regular pattern of data transfer until about 3792ms as indicated in the top middle of the screen.
This close up view shows the good block of eight SPI control bytes (in the middle) that would be expected just after the serial TX.
This close up view shoes the corrupted SPI transfer and bad control bytes just after a good TX.
Here is the full PulseView logic capture shown in the screenshots. PulseView is open source so you can download it to view the capture: spi_error_capture.zip It seems the protocol decoder does not get saved to the zip. Click this icon in PulseView to add SPI and UART protocol decoders:
We are using this chip for motor control on our open source farming robot and we have laid out a custom board with three of these chips controlled by SPI. It worked great before, so we are pretty motivated to help fix it. But I am not entirely sure where to look from here.
I did notice that there are other pauses in SPI, but shorter pauses do not seem to cause a problem. Is it expected that SPI would have pauses in communication like this? If so, it may not be buffering the data properly. For some pauses, the correct bits are clocked after the pause, but for the corrupting pause the bits are incorrect.
I originally ran it at 15MHz SPI clock speed, but I lowered it to 4MHz, then 1MHz, and finally 400kHz to see if that helped and to make it easier to capture with the logic analyzer. The problems persisted at all clock speeds. Running a CPU stress test in another terminal seems to have no affect on the problem.
Any advice would be appreciated!
I just noticed #4100, which affected SPI on 5.10. However I tried the kernel with that fix (43998c82a7e88df284b7aa30221b2d0d21b2b86a) and I still had the same issues. Someone may want to look more in to what was being fixed in that patch.
I looked more in to the 5.4 and 5.10 branches and discovered that the sc16is7xx driver in 5.10 was actually behind that of 5.4 until Jan 27 2021, the day before the above fix, so I believe I have now tested identical driver versions across the two different kernels. See: https://github.com/raspberrypi/linux/commits/rpi-5.4.y/drivers/tty/serial/sc16is7xx.c https://github.com/raspberrypi/linux/commits/rpi-5.10.y/drivers/tty/serial/sc16is7xx.c
But I did realize that the way I wrote my test program contributed to the irregular transfers. The "garbled" transfers in my above post were due to extra data building up in the serial buffer during the short SPI pauses. I have made some adjustments it seems that small pauses no longer mess it up. It does still sometimes read the wrong bytes however which is certainly a problem, so I will continue to debug and try to get a capture of that.
If anyone wants to write a protocol decoder for PulseView for this chip, it would be helpful. Especially if you are already familiar with the SPI protocol, as I am not. The decoder is python and seems relatively approachable. https://sigrok.org/wiki/Protocol_decoder_HOWTO
I was able to capture an error with my updated program, though the behavior looks similar to before my updates. I think I need to sleep on this. I will call it a night and see if anyone has any thoughts. updated_spi_capture.zip
Have you seen #4885? There's a fix for that in the latest rpi-update release that also fixed #4711.
Oh great! I saw #4885 but I did not realize that would fix it for SPI. I hadn't seen #4711. Thank you! I will try those and report back.
They latest rpi-update fixes it for me! I have a board with three sc16is752 chips and my code uses four of the six ports at once. With the new kernel it all works with no errors. Thanks everyone who made this happen!
I'm going to mark this as closed. For any further discussion, please use #4885.
Describe the bug The sc16is7xx in kernel 5.10 exhibits poor performance compared with the driver in 4.19.
To reproduce A board with an SC16IS762 is connected over SPI0. The board uses a 32Mhz crystal and is configured in the overlay at 15Mbps SPI communication speed. See the section Additional context below for more details. The communication is produced by a ROS node that commands 10 Dynamixel servos on one bus and 12 on the other at 2Mbps communication speed.
Expected behaviour No significant difference in performance between 5.10 and 4.19 versions.
Actual behaviour 5.10 version of the driver exhibits almost 10x more communication errors limiting severely the frequency of update for servos.
The comparison is performed by compiling the 4.19 driver on a Raspberry Pi running 5.10 kernel and replacing the
.ko
.System Copy and paste the results of the raspinfo command in to this section. Alternatively, copy and paste a pastebin link, or add answers to the following questions:
cat /etc/rpi-issue
)?vcgencmd version
)?uname -a
)?Logs There are messages issues in
dmesg
by both versions of the driver like this:But this seems to be significantly more happening in the 5.10 kernel driver.
Additional context The driver is used with a board that uses sc16is762 chip and connects over SPI0 with CE1. On the same SPI0 there is a TFT screen using ST7789 using CE0. Because the stock
dtb
for sc16is7xx does not support override for CE pin and the SPI communication baud I use a customdts
as follows:The focus is to handle high-speed communication over the 2 UART ports. The experiments performed with the 2 drivers use 2Mbps communication speeds for each port.
Also both ports are configured in RS485 mode as the board leverages the hardware flow control to enable the communication direction over a semi-duplex TTL bus. The setup is done in the software using
ioctl
:Do let me know in case I need to post additional information.