linux-surface / kernel

Linux kernel with modifications for Microsoft Surface devices.
Other
118 stars 33 forks source link

platform/surface: aggregator: Defer probing when serdev is not ready #152

Closed phreer closed 4 months ago

phreer commented 4 months ago

Failure of probing of SSAM serial hub was observed (https://github.com/linux-surface/linux-surface/issues/1271) from time to time due to race condition where the underlying uart device could be not ready at the point calling serdev_device_open(). Specifically, the failing path is

ssam_serial_hub_probe()
    => serdev_device_open()
        => ctrl->ops->open() // this is ttyport_open()
            => tty->ops->open() // this is uart_open()
                => tty_port_open()
                    => port->ops->activate() // this is uart_port_activate()
                        Find bit UPF_DEAD is set in uport->flags and fail with errno -ENXIO.

Retrying upon failure in opening serdev device would greatly alleviate this situation. Tested with my Surface Pro 5 rebooting for 10+ times and didn't see the issue anymore.

I admit this solution might be suboptimal, but it does help cure the issue that users lost their battery indicator or touchpad functionality randomly.

qzed commented 4 months ago

Thanks for the PR (and especially the breakdown of what goes wrong). But as you said, I think we can do better: Generally, in cases like this the driver should return EPROBE_DEFER. This tells the device driver core that the driver is not yet ready and missing some dependency, and that it should try again later. Could you try something like this?

if (status == -ENXIO)  // or whatever is returned by serdev_device_open() when it fails due to not being ready
    return -EPROBE_DEFER;
else if (status)
    return status;

Apart from me not knowing the specific error code, the reason I haven't added this yet is that I don't know where specifically this should be addressed. I.e., should serdev_device_open() return EPROBE_DEFER or should we check the error code and do it ourselves in the SAM driver?

Due to that, my current suggestion for a workaround would be to build-in the required modules (8250_dw and pinctrl ones) and avoid the race condition in that way (like Fedora and Ubuntu/Debian already do). This seems to have worked here and I was planning on doing that today. Right now, I'm also not sure whether this could also fail due to pinctrl dependencies not being present. Some users have reported that the buttons (soc-button-array) on Arch do not work due to that. And if pinctrl isn't there, the gpiod_count() or ssam_irq_setup() calls might fail too. So we probably want to build in at least those anyways.

If I remember correctly Hans de Goede mentioned that (on x86?) the pinctrl stuff should be always built in (or at least present in the initramfs). And since I'm already mentioning him:

@jwrdegoede: Maybe you know what the proper way of returning EPROBE_DEFER above is?

Edit: Looks like I've been mistaken and Fedora doesn't include all the pinctrl modules. But it looks like they are all present in the initramfs.

qzed commented 4 months ago

@phreer: I've opened a PR for building in the modules at https://github.com/linux-surface/linux-surface/pull/1426. In case we merge this before you get around to testing with EPROBE_DEFER, please make sure that you do not build in those modules.

phreer commented 4 months ago

So serdev_device_open() returning error is only part of the story, there might be somewhere else we get an error. OK, looks like https://github.com/linux-surface/linux-surface/pull/1426 is a more reasonable workaround for the time being. I'd love to investigate how to return EPROBE_DEFER thing properly and figure out a long-term solution for this issue :).

qzed commented 4 months ago

I think the proper way is probably a mix of both: Build in the pinctrl modules or have them in the initramfs (as a lot of stuff relies on them) but don't do that for the serial ones as they are a bit more specific. Then return EPROBE_DEFER (probably in the SAM driver) if that works to make it re-try if the serial driver isn't there yet.

phreer commented 4 months ago

I just got basic idea how deferred driver probing works in kernel and will play around with it soon, hopefully enable this functionality for SAM driver.

But I don't get the point why we still require modules like pinctrl to be built in even if deferred probing is supposed.

qzed commented 4 months ago

But I don't get the point why we still require modules like pinctrl to be built in even if deferred probing is supposed.

Returning EPROBE_DEFER for serdev_device_open() only helps for the serial driver. So anything that relies on the pinctrl drivers being present might still fail.

phreer commented 4 months ago

Yeah, of course. But what if we also return EPROBE_DEFER when resources like GPIO are not ready during ssam_serial_hub_probe()? IIUC, your point is that we can hardly get an exhaustive list of cases where we need defer probing instead of returning real error. Maybe we could just include all these cases gradually, starting from the most obvious ones.

jwrdegoede commented 4 months ago

@jwrdegoede: Maybe you know what the proper way of returning EPROBE_DEFER above is?

Generally speaking if you have some sort of get() function for a resource and the resource is not yet ready then you would expect that get() function to return -EPROBE_DEFER and the caller will just propagate it.

This is also where using dev_err_probe() is useful for logging errors from such get() functions during probe since it will turn the msg into a dbg message when the error code is -EPROBE_DEFER.

As for this specific case, where does the serdev the driver tries to open come from ? I would expect this to be created by the tty subsystem, specifically by acpi_serdev_register_devices() from drivers/tty/serdev/core.c based on some ACPI device having a UartSerialBus resource pointing to the port.

Since this serdev is instantiated at the time the tty subsystem is probing it I find it weird that when your serdev driver (I assume you are using a serdev driver?) tries to bind to it that it is not ready yet.

I think it might be best to discuss this on the linux-serial mailinglist. Feel free to Cc me.

qzed commented 4 months ago

@phreer

Yeah, of course. But what if we also return EPROBE_DEFER when resources like GPIO are not ready during ssam_serial_hub_probe()? IIUC, your point is that we can hardly get an exhaustive list of cases where we need defer probing instead of returning real error. Maybe we could just include all these cases gradually, starting from the most obvious ones.

I think that would be the most ideal solution. But pinctrl drivers are spread out quite a bit, so I'm not really sure what all would need to be done/checked here.

qzed commented 4 months ago

@jwrdegoede

As for this specific case, where does the serdev the driver tries to open come from ? I would expect this to be created by the tty subsystem, specifically by acpi_serdev_register_devices() from drivers/tty/serdev/core.c based on some ACPI device having a UartSerialBus resource pointing to the port.

Yes, the SAM device in ACPI links to UartSerialBusV2 resource, so that's where the serdev comes from.

Since this serdev is instantiated at the time the tty subsystem is probing it I find it weird that when your serdev driver (I assume you are using a serdev driver?) tries to bind to it that it is not ready yet.

It is a serdev driver, yes. From the reports I've got, building in 8250_dw and the intel LPSS drivers does provide a functioning workaround. So due to that, to me it looks like there is some stuff going on in parallel. Like some parts of serdev are registered so that our driver can probe, but the underlying provider of the port is not yet ready. I know too little of the underlying things at work here to say what exactly is happening though.

I will try to look into it a bit more and then send a mail to linux-serial. Thanks!

phreer commented 4 months ago

Hi @qzed, me again. As you suggested, I tried the way of deferring probing and it worked as expected.

Only deal with serdev case for now. Let's be patient and identify more race condition points in the future.

phreer commented 4 months ago

@qzed Sure! Do you think it is a good idea to add more messages upon error, e.g., gpiod_count? (As IMHO it's kind of lacking context in dmesg when something unexpected happen.)

qzed commented 4 months ago

@phreer I think that could be quite helpful, yes.

phreer commented 4 months ago

Hope those error messages in the new commit would help!

Edit: I tested these changes with my SP5, all looking good.

phreer commented 4 months ago

Hi @qzed , I've completed my code and would appreciate guidance on how to proceed further.

qzed commented 4 months ago

Hi, sorry for the delay. I've been a bit busy with day-time work the last days again.

I'm not a big fan of moving the controller init to the front, I'd prefer if we can keep the order (with smaller/lighter init stuff in the beginning). Just use the dev_ functions for logging with that. And thinking about that: I guess we also don't really need the ssam_err_probe() because serdev->dev is directly available here. Sorry for not noticing that earlier.

Apart from that I think it looks good.

I think you could also submit the EPROBE_DEFER patch as RFC to the pdx86 and linux-serial mailing lists to get the discussion started on that, if you want. I'm not sure when I will find the time for that, as I will be away for a large part of May and have some stuff to take care off before.

phreer commented 4 months ago

Sure, it makes sense to let light staffs go first so we can fail fast. I'm pleased to start a discuss the EPROBE_DEFER patch on mailing list. Lastly, I appreciate your assistance despite your busy schedule. Thank you!