mozilla / uniffi-rs

a multi-language bindings generator for rust
https://mozilla.github.io/uniffi-rs/
Mozilla Public License 2.0
2.86k stars 232 forks source link

Uniffi ForeignFuture inconsistenty dropped #2209

Open skeet70 opened 3 months ago

skeet70 commented 3 months ago

We've observed Uniffi ForeignFutures inconsistently being dropped. I'm assuming this is a bug because there's a test in the existing future fixtures that tries to cover it.

Here's an example failure with timing tracing all the way through the relevant calls. The timings are in epoch milliseconds. In this test the delay being cancelled is set to 200ms, and the local delay is set to 500ms, but in our experience it makes no difference.

Calling for cancel_delay from Java: 1723067197348
About to await. Time: 1723067197349
Delay in Java trait impl called: 1723067197349
About to abort. Time: 1723067197350
After the abort. Time: 1723067197351
Delay in Java trait impl finished executing: 1723067197549
Dropping handle 6. Time: 1723067197550
ForeignFutureFreeImpl called from test: 1723067197550
After local 500 ms delay: 1723067198051

You can see the abort (in rust code) happens correctly around 1ms after the Delay in Java trait called.

In the failing example pasted above, the drop in Rust (then in Java right after), doesn't happen until a bit after the 200ms delay in Java finishes, even though the abort happened long before.

In successful runs (on the same code commit), the Dropping handle and ForeignFutureFreeImpl are called within 2ms, and because of that the Delay in Java trait impl finished executing never happens (because the free in Java cancels the the future on its side).

Calling for cancel_delay from Java: 1723067029269
About to await. Time: 1723067029270
Delay in Java trait impl called: 1723067029270
About to abort. Time: 1723067029271
After the abort. Time: 1723067029271
Dropping handle 6. Time: 1723067029272
ForeignFutureFreeImpl called from test: 1723067029272
After local 500 ms delay: 1723067029773
After test scope: 1723067029773

I looked at the code I think is being executed in Rust and I don't understand why there's this different behavior in dropping, but it does definitively seem to be happening.

Worth noting that this is very hard to reproduce on our local machines, but pretty consistent in our CI. Have you seen any flakiness in these tests in your Circle CI? Given that the failure to call drop is happening in the Rust code, I don't see any reason it'd only be happening for Java bindings. I have to run for i in $(seq 1 1000); do echo $i; cargo t test_futures_fixtures || break; done locally, and the fastest I've seen it is 34 iterations, slowest so far 1200.

bendk commented 3 months ago

Can you link to the code that you're running with the printouts? I'm especially interested in where the "dropping handle" print statement is.

skeet70 commented 3 months ago

This is the bingen branch, at 8a14d1c.

This is the uniffi-rs fork code, at f15686a.

The "dropping handle" print statement is in the impl Drop for ForeignFuture, in the Rust code.

bendk commented 3 months ago

This test is really complex because you have a Rust future that's awaiting a foreign future. At the end of all of that, the foreign future gets dropped, which should cancel it. Based on the current logging, my best guess is the Rust future isn't completing and therefore never dropping the foreign future. Let's try to test that with some more logging:

skeet70 commented 3 months ago

I was getting CI tests going for that while running it locally, and got a local failure:

Calling for cancel_delay from Java: 1723151811926
rust_future_new:
    handle: Handle(140376982340096)
    time: 1723151811926
rust_future_poll:
    handle: Handle(140376982340096)
    time: 1723151811927
About to await. Time: 1723151811927
Delay in Java trait impl called: 1723151811927
About to abort. Time: 1723151811928
After the abort. Time: 1723151811928
Delay in Java trait impl finished executing: 1723151812128
rust_future_poll:
    handle: Handle(140376982340096)
    time: 1723151812128
Dropping handle 6. Time: 1723151812128
ForeignFutureFreeImpl called from test: 1723151812128
Finished awaiting foreign future: 1723151812128
rust_future_complete:
    handle: Handle(140376982340096)
    time: 1723151812129
rust_future_free:
    handle: Handle(140376982340096)
    time: 1723151812129
After local 500 ms delay: 1723151812629
Exception in thread "main" java.lang.AssertionError: 2 current delays != 1 delays before
    at TestFixtureFutures.main(TestFixtureFutures.java:275)

vs a successful run

Calling for cancel_delay from Java: 1723151800835
rust_future_new:
    handle: Handle(140327120421136)
    time: 1723151800835
rust_future_poll:
    handle: Handle(140327120421136)
    time: 1723151800837
About to await. Time: 1723151800837
Delay in Java trait impl called: 1723151800837
About to abort. Time: 1723151800838
After the abort. Time: 1723151800839
rust_future_poll:
    handle: Handle(140327120421136)
    time: 1723151800839
Dropping handle 6. Time: 1723151800839
ForeignFutureFreeImpl called from test: 1723151800839
Finished awaiting foreign future: 1723151800839
rust_future_complete:
    handle: Handle(140327120421136)
    time: 1723151800839
rust_future_free:
    handle: Handle(140327120421136)
    time: 1723151800839
After local 500 ms delay: 1723151801340

It looks like you're right, the aborted RustFuture isn't being polled to see if it even should be dropped till after the delay finishes in the foreign code.

bendk commented 3 months ago

Looks like we're getting somewhere. Let's add some more printouts:

skeet70 commented 3 months ago

Just starting to analyze this myself, but posting the log right away.

Calling for cancel_delay from Java: 1723218321826
rust_future_new:
    handle: Handle(140482947359088)
    time: 1723218321826
Polling Rust Future 140482947359088 from Java: 1723218321829
rust_future_poll:
    handle: Handle(140482947359088)
    time: 1723218321829
About to await. Time: 1723218321829
Delay in Java trait impl called: 1723218321829
About to abort. Time: 1723218321831
Scheduler wake called, invoking callback: 1723218321831
Rust Future Java callback called: 1723218321832
Rust Future Javacallback completed: 1723218321832
Scheduler finished callback: 1723218321833
After the abort. Time: 1723218321833
Delay in Java trait impl finished executing: 1723218322029
Finished polling Rust Future 140482947359088 from Java: 1723218322030
Polling Rust Future 140482947359088 from Java: 1723218322030
rust_future_poll:
    handle: Handle(140482947359088)
    time: 1723218322030
Dropping handle 6. Time: 1723218322030
ForeignFutureFreeImpl called from test: 1723218322030
Finished awaiting foreign future: 1723218322030
Rust Future Java callback called: 1723218322030
Rust Future Javacallback completed: 1723218322030
Finished polling Rust Future 140482947359088 from Java: 1723218322030
rust_future_complete:
    handle: Handle(140482947359088)
    time: 1723218322030
rust_future_free:
    handle: Handle(140482947359088)
    time: 1723218322031
After local 500 ms delay: 1723218322531
Exception in thread "main" java.lang.AssertionError: 2 current delays != 1 delays before
    at TestFixtureFutures.main(TestFixtureFutures.java:275)

The same level of logging on a success:

Calling for cancel_delay from Java: 1723218804268
rust_future_new:
    handle: Handle(140386243303888)
    time: 1723218804268
Polling Rust Future 140386243303888 from Java: 1723218804269
rust_future_poll:
    handle: Handle(140386243303888)
    time: 1723218804269
About to await. Time: 1723218804269
Delay in Java trait impl called: 1723218804269
About to abort. Time: 1723218804270
Scheduler wake called, invoking callback: 1723218804270
Rust Future Java callback called: 1723218804270
Rust Future Javacallback completed: 1723218804271
Finished polling Rust Future 140386243303888 from Java: 1723218804271
Scheduler finished callback: 1723218804271
After the abort. Time: 1723218804271
Polling Rust Future 140386243303888 from Java: 1723218804271
rust_future_poll:
    handle: Handle(140386243303888)
    time: 1723218804271
Dropping handle 6. Time: 1723218804271
ForeignFutureFreeImpl called from test: 1723218804271
Finished awaiting foreign future: 1723218804271
Rust Future Java callback called: 1723218804271
Rust Future Javacallback completed: 1723218804271
Finished polling Rust Future 140386243303888 from Java: 1723218804271
rust_future_complete:
    handle: Handle(140386243303888)
    time: 1723218804271
rust_future_free:
    handle: Handle(140386243303888)
    time: 1723218804271
After local 500 ms delay: 1723218804772
skeet70 commented 3 months ago

Added a bit more tracing to the handle on the rust side, some more info about continuation handle on the Java side:

Failure:

Calling for cancel_delay from Java: 1723224979890
rust_future_new:
    handle: Handle(140093111784656)
    time: 1723224979891
Polling Rust Future 140093111784656 from Java: 1723224979892
Calling extern poll function for RustFuture handle 140093111784656, continuationHandle 5 from Java: 1723224979892
rust_future_poll:
    handle: Handle(140093111784656)
    time: 1723224979892
About to await. Time: 1723224979893
Delay in Java trait impl called: 1723224979893
foreign_async_call for ForeignFuture handle 2: 1723224979893
About to abort. Time: 1723224979895
Scheduler wake called, invoking callback: 1723224979895
Rust Future Java callback called for continuationHandle 5: 1723224979895
Rust Future Java callback for continuationHandle 5 completed: 1723224979895
Scheduler finished callback: 1723224979895
After the abort. Time: 1723224979895
Delay in Java trait impl finished executing: 1723224980093
Finished polling Rust Future 140093111784656 from Java: 1723224980095
Polling Rust Future 140093111784656 from Java: 1723224980095
Calling extern poll function for RustFuture handle 140093111784656, continuationHandle 6 from Java: 1723224980095
rust_future_poll:
    handle: Handle(140093111784656)
    time: 1723224980095
Dropping handle 2: 1723224980095
ForeignFutureFreeImpl called from test: 1723224980095
Finished awaiting foreign future: 1723224980095
Rust Future Java callback called for continuationHandle 6: 1723224980095
Rust Future Java callback for continuationHandle 6 completed: 1723224980095
Finished polling Rust Future 140093111784656 from Java: 1723224980095
rust_future_complete:
    handle: Handle(140093111784656)
    time: 1723224980096
rust_future_free:
    handle: Handle(140093111784656)
    time: 1723224980096
After local 500 ms delay: 1723224980596
Exception in thread "main" java.lang.AssertionError: 2 current delays != 1 delays before
    at TestFixtureFutures.main(TestFixtureFutures.java:275)

Success:

Calling for cancel_delay from Java: 1723224971354
rust_future_new:
    handle: Handle(140578308838176)
    time: 1723224971354
Polling Rust Future 140578308838176 from Java: 1723224971355
Calling extern poll function for RustFuture handle 140578308838176, continuationHandle 5 from Java: 1723224971356
rust_future_poll:
    handle: Handle(140578308838176)
    time: 1723224971356
About to await. Time: 1723224971356
Delay in Java trait impl called: 1723224971356
foreign_async_call for ForeignFuture handle 2: 1723224971356
About to abort. Time: 1723224971357
Scheduler wake called, invoking callback: 1723224971357
Rust Future Java callback called for continuationHandle 5: 1723224971357
Rust Future Java callback for continuationHandle 5 completed: 1723224971357
Scheduler finished callback: 1723224971357
After the abort. Time: 1723224971357
Finished polling Rust Future 140578308838176 from Java: 1723224971357
Polling Rust Future 140578308838176 from Java: 1723224971357
Calling extern poll function for RustFuture handle 140578308838176, continuationHandle 6 from Java: 1723224971357
rust_future_poll:
    handle: Handle(140578308838176)
    time: 1723224971358
Dropping handle 2: 1723224971358
ForeignFutureFreeImpl called from test: 1723224971358
Finished awaiting foreign future: 1723224971358
Rust Future Java callback called for continuationHandle 6: 1723224971358
Rust Future Java callback for continuationHandle 6 completed: 1723224971358
Finished polling Rust Future 140578308838176 from Java: 1723224971358
rust_future_complete:
    handle: Handle(140578308838176)
    time: 1723224971358
rust_future_free:
    handle: Handle(140578308838176)
    time: 1723224971358
After local 500 ms delay: 1723224971859
bendk commented 3 months ago

This part sticks out to me. The rust code is invoking the future callback, the Java code sees the callback and completes the continuation handle, but for some reason finished polling rust future doesn't happen for 200ms after that.

I don't know enough about Java concurrency to understand what's going on, but it seems like the issue is that the continuation is being completed, but CompletableFuture.get() is not returning.

Scheduler wake called, invoking callback: 1723224979895
Rust Future Java callback called for continuationHandle 5: 1723224979895
Rust Future Java callback for continuationHandle 5 completed: 1723224979895
Scheduler finished callback: 1723224979895
After the abort. Time: 1723224979895
Delay in Java trait impl finished executing: 1723224980093
Finished polling Rust Future 140093111784656 from Java: 1723224980095
skeet70 commented 3 months ago

Yes, that's exactly where I'm at. I've been thinking about it for the last hour or so, I know a little about Java concurrency but clearly not enough to understand why that would be the case either. I added some thread related logging thinking that the thread blocked on get might be getting starved or itself blocked or something, but that doesn't look like it's the case either. The threads things are running on line up with my intuition of where they'd be.

skeet70 commented 3 months ago

I'm most suspicious of the pollingFuture.get call itself. get is supposed to block until it finishes, but I saw that some other things (the call back from Rust to delay for ex) were being executed on the same thread that the pollingFuture.get was supposedly blocking. I rewrote that to more explicitly block with:

do {} while (!pollingFuture.isDone());
pollingFuture.get();

So far running that code I haven't been able to get any failures in CI or locally. Going to give the local one a lot more runs to see. Still not exactly sure what this tells me other than that Java is deceiving me.

Pretty confident at this point that it's not a uniffi bug, this issue can probably be closed.