mer-hybris / libgbinder

GLib-style interface to binder
BSD 3-Clause "New" or "Revised" License
50 stars 42 forks source link

BC_FREE_BUFFER not getting executed after processing oneway RIL indication #62

Open NotKit opened 3 years ago

NotKit commented 3 years ago

At some point in time (hours or days after boot) ofono gets stuck on Ubuntu Touch Android 9 adaptations and stops processing incoming async transactions. It results in more and more buffers being allocated for incoming indication transactions from RIL which do not reach ofono until there are no buffers left, which results in binder transaction failure with code -28 in dmesg:

binder: 2989:3012 transaction failed 29201/-28, size 6572-1304 line 3137

It seems to be caused by BC_FREE_BUFFER command not getting issued under rare circumstances which I have not been able to figure out yet. The way async_todo queue is handled in binder kernel driver, next oneway transaction is dequeued only after BC_FREE_BUFFER command is processed.

I have not seen this issue on SailfishOS, however it is common on Ubuntu Touch devices based on Android 9, even though it uses SailfishOS fork of ofono for those devices and I tested with libgbinder 1.1.5 on UT.

In attached binder_trace.log, the last oneway transaction processed by /dev/hwbinder looper is the following and has no matching binder_transaction_buffer_release: /dev/hwbinder#-13934 [005] ...1 68522.162937: binder_transaction_received: transaction=45804

It corresponds to 2021-02-19 17:21:25 [gbinder] < BC_TRANSACTION 0x00000002 0x00000082 in ofono.log. ofono_binder_error_28.tar.gz

From my understanding of libgbinder code, the buffers for incoming one-way transactions are supposed to be freed after remote request is unreferenced here: https://github.com/mer-hybris/libgbinder/blob/master/src/gbinder_driver.c#L498. Going through ibgbinder-radio and ofono-ril-binder-plugin code, I have not encountered anything that could increase reference count for it. Would you have any idea what else could this be related to?

monich commented 3 years ago

Please check if it's still happening with libgbinder 1.1.6, which touched quite a few things related to buffer management.

fredldotme commented 3 years ago

It took around 22 hours on the Volla phone to trigger the issue with libgbinder 1.1.6. The other test subject with both updated libgbinder and libgbinder-radio is still running though. Both have been started roughly at the same time.

mariogrip commented 3 years ago

As an additional note, we do use glib2 version 2.48.2, maybe bug there?

mariogrip commented 3 years ago

So mine failed also with both libgbinder and libgbinder-radio updated, so 1.1.6 sadly didnt fix the issue :(

@monich Maybe the glib2 version have something to say? what version do sailfish use?

monich commented 3 years ago

I very much doubt that this has anything to do with glib2, we have tried many different versions and never experienced this sort of problem. Most recently it's been 2.66.2. My feeling is that it's more likely to be related to the binder driver.

mariogrip commented 3 years ago

I very much doubt that this has anything to do with glib2, we have tried many different versions and never experienced this sort of problem. Most recently it's been 2.66.2. My feeling is that it's more likely to be related to the binder driver.

Reason why i ask, is that we have tried many different phones and chipsets and it happens on all, even tried the same phones with the same kernel and hal on sailfish but it does not happen there. All from kernel, hal, and the ofono stack is the same.

monich commented 3 years ago

Another way to try to pin down the root cause would be to try running ofono under valgrind (which is something I periodically do). If it's indeed something not being deallocated (and BC_FREE_BUFFER not being sent as a consequence) it must show up as a leak in valgrind. It would be nice to confirm that it's actually a refcount screw-up somewhere in ofono binder plugin.

mariogrip commented 3 years ago

Another way to try to pin down the root cause would be to try running ofono under valgrind (which is something I periodically do). If it's indeed something not being deallocated (and BC_FREE_BUFFER not being sent as a consequence) it must show up as a leak in valgrind. It would be nice to confirm that it's actually a refcount screw-up somewhere in ofono binder plugin.

I have ran this under valgrind for over 6 days now... no issues at all. So it does not seems to appear when its ran with valgrind... This devices fails within 24h when running without valgrind.

monich commented 3 years ago

I have ran this under valgrind for over 6 days now... no issues at all. So it does not seems to appear when its ran with valgrind... This devices fails within 24h when running without valgrind.

Hmm, this suggests some sort of a race condition (valgrind slowing things down, shifting timings sufficiently for the race to never occur). That may be a tough one to debug.

monich commented 3 years ago

Next step that I suggest to do is to reproduce it like @NotKit originally did, but this time with GUTIL_LOG_TID=1 (and libglibutil >= 1.0.51) on the ofono side. That will prefix each line of debug output with the thread id. I added this GUTIL_LOG_TID thing relatively recently: https://git.sailfishos.org/mer-core/libglibutil/commit/c707a15