projectfluent / fluent-rs

Rust implementation of Project Fluent
https://projectfluent.org
Apache License 2.0
1.08k stars 97 forks source link

Handle state changes mid async formats #228

Closed zbraniecki closed 3 years ago

zbraniecki commented 3 years ago

This is a draft attempt to address the problem encountered when working on using fluent-fallback in Gecko.

While all other oranges in taskcluster seem to be just races, there is one that is a crash - https://treeherder.mozilla.org/jobs?repo=try&revision=b8e6510cd9b750fbb6902e833df2b1c4a6edc5c1&selectedTaskRun=M_G5cJS4SieOKV6ehgN8tQ.0

The relevant part of the stack is:

[task 2021-06-18T22:24:49.099Z] 22:24:49     INFO -  8  libxul.so!core::option::expect_none_failed [option.rs:88f19c6dab716c6281af7602e30f413e809c5974 : 1329 + 0x65]
[task 2021-06-18T22:24:49.100Z] 22:24:49     INFO -     rbx = 0x00007fdaec196c60   rbp = 0x00007fffc2975890
[task 2021-06-18T22:24:49.100Z] 22:24:49     INFO -     rsp = 0x00007fffc2975360   r12 = 0x0003001100000000
[task 2021-06-18T22:24:49.101Z] 22:24:49     INFO -     r13 = 0x0000000000000001   r14 = 0x00007fdacfdea340
[task 2021-06-18T22:24:49.101Z] 22:24:49     INFO -     r15 = 0x00007fdaca776000   rip = 0x00007fdae11dc082
[task 2021-06-18T22:24:49.102Z] 22:24:49     INFO -     Found by: call frame info
[task 2021-06-18T22:24:49.102Z] 22:24:49     INFO -  9  libxul.so!localization_add_res_id [lib.rs:b8e6510cd9b750fbb6902e833df2b1c4a6edc5c1 : 426 + 0x26]
[task 2021-06-18T22:24:49.103Z] 22:24:49     INFO -     rbx = 0x00007fdaec196c60   rbp = 0x00007fffc2975890
[task 2021-06-18T22:24:49.103Z] 22:24:49     INFO -     rsp = 0x00007fffc29753e0   r12 = 0x0003001100000000
[task 2021-06-18T22:24:49.104Z] 22:24:49     INFO -     r13 = 0x0000000000000001   r14 = 0x00007fdacfdea340
[task 2021-06-18T22:24:49.104Z] 22:24:49     INFO -     r15 = 0x00007fdaca776000   rip = 0x00007fdae1585e0c
[task 2021-06-18T22:24:49.105Z] 22:24:49     INFO -     Found by: call frame info
[task 2021-06-18T22:24:49.105Z] 22:24:49     INFO - 10  libxul.so!mozilla::dom::Document::LocalizationLinkAdded(mozilla::dom::Element*) [Document.cpp:b8e6510cd9b750fbb6902e833df2b1c4a6edc5c1 : 4178 + 0x7a]
[task 2021-06-18T22:24:49.105Z] 22:24:49     INFO -     rbx = 0x00007fffc29758a8   rbp = 0x00007fffc29759c0
[task 2021-06-18T22:24:49.106Z] 22:24:49     INFO -     rsp = 0x00007fffc29758a0   r12 = 0x0003001100000000
[task 2021-06-18T22:24:49.106Z] 22:24:49     INFO -     r13 = 0x0000000000000001   r14 = 0x00007fdacfdea340
[task 2021-06-18T22:24:49.107Z] 22:24:49     INFO -     r15 = 0x00007fdaca776000   rip = 0x00007fdaddb616c9
[task 2021-06-18T22:24:49.107Z] 22:24:49     INFO -     Found by: call frame info
[task 2021-06-18T22:24:49.108Z] 22:24:49     INFO - 11  libxul.so!mozilla::dom::HTMLLinkElement::BindToTree(mozilla::dom::BindContext&, nsINode&) [HTMLLinkElement.cpp:b8e6510cd9b750fbb6902e833df2b1c4a6edc5c1 : 100 + 0xb]
[task 2021-06-18T22:24:49.108Z] 22:24:49     INFO -     rbx = 0x00007fdacfacb4c0   rbp = 0x00007fffc2975a10
[task 2021-06-18T22:24:49.109Z] 22:24:49     INFO -     rsp = 0x00007fffc29759d0   r12 = 0x00007fdac8c83ae0
[task 2021-06-18T22:24:49.109Z] 22:24:49     INFO -     r13 = 0x0000000000000001   r14 = 0x0000000000000000
[task 2021-06-18T22:24:49.110Z] 22:24:49     INFO -     r15 = 0x00007fffc2975b50   rip = 0x00007fdade9617e9
[task 2021-06-18T22:24:49.110Z] 22:24:49     INFO -     Found by: call frame info
[task 2021-06-18T22:24:49.110Z] 22:24:49     INFO - 12  libxul.so!nsINode::InsertChildBefore(nsIContent*, nsIContent*, bool, mozilla::ErrorResult&) [nsINode.cpp:b8e6510cd9b750fbb6902e833df2b1c4a6edc5c1 : 1544 + 0x13]
[task 2021-06-18T22:24:49.111Z] 22:24:49     INFO -     rbx = 0x00007fdac8c84100   rbp = 0x00007fffc2975b90
[task 2021-06-18T22:24:49.111Z] 22:24:49     INFO -     rsp = 0x00007fffc2975a20   r12 = 0x00007fdac8c024b8
[task 2021-06-18T22:24:49.111Z] 22:24:49     INFO -     r13 = 0x0000000000000001   r14 = 0x00007fdac8c83ae0
[task 2021-06-18T22:24:49.111Z] 22:24:49     INFO -     r15 = 0x00007fdac8c02480   rip = 0x00007fdaddc5b701
[task 2021-06-18T22:24:49.112Z] 22:24:49     INFO -     Found by: call frame info
[task 2021-06-18T22:24:49.112Z] 22:24:49     INFO - 13  libxul.so!nsINode::ReplaceOrInsertBefore(bool, nsINode*, nsINode*, mozilla::ErrorResult&) [nsINode.cpp:b8e6510cd9b750fbb6902e833df2b1c4a6edc5c1 : 2712 + 0x1f]
[task 2021-06-18T22:24:49.112Z] 22:24:49     INFO -     rbx = 0x00007fdaca776000   rbp = 0x00007fffc2975e30
[task 2021-06-18T22:24:49.113Z] 22:24:49     INFO -     rsp = 0x00007fffc2975ba0   r12 = 0x00007fdac8c02480
[task 2021-06-18T22:24:49.113Z] 22:24:49     INFO -     r13 = 0x00007fffc2975e48   r14 = 0x00007fdac8c83ae0
[task 2021-06-18T22:24:49.113Z] 22:24:49     INFO -     r15 = 0x0000000000000000   rip = 0x00007fdaddc5e16c
[task 2021-06-18T22:24:49.114Z] 22:24:49     INFO -     Found by: call frame info
[task 2021-06-18T22:24:49.114Z] 22:24:49     INFO - 14  libxul.so!mozilla::dom::Node_Binding::appendChild(JSContext*, JS::Handle<JSObject*>, void*, JSJitMethodCallArgs const&) [NodeBinding.cpp: : 992 + 0xf]
[task 2021-06-18T22:24:49.114Z] 22:24:49     INFO -     rbx = 0x00007fdac8c83ae0   rbp = 0x00007fffc2975eb0
[task 2021-06-18T22:24:49.114Z] 22:24:49     INFO -     rsp = 0x00007fffc2975e40   r12 = 0xfffe000000000000
[task 2021-06-18T22:24:49.115Z] 22:24:49     INFO -     r13 = 0x00007fffc2975f10   r14 = 0x00007fdad1523000
[task 2021-06-18T22:24:49.115Z] 22:24:49     INFO -     r15 = 0x00007fdac8c02480   rip = 0x00007fdadded6762
[task 2021-06-18T22:24:49.115Z] 22:24:49     INFO -     Found by: call frame info

The particular source line is https://hg.mozilla.org/try/file/3d5d1b1619ed5d6496e6b0dd2a40a07af5d01699/intl/l10n/rust/localization-ffi/src/lib.rs#l152

The cause of the crash seems to be that there is a call to add_resource_id while an async format_messages is being executed.

I am able to reproduce the crash locally if I apply the patchset and run the incriminated test with --verify.

I am not clear on how multi-threaded Gecko processes are and in what scenario a single document's Localization can be executing format_messages and in the middle call add_resource_id.

It may be there this is some flaw fixable with a switch from Rc to Arc, maybe? Or it may be that this is an inherent flaw in fluent-fallback.

In particular, the way we handle that in JS implementation is that we generate the cached iter/stream, and use it for all calls until the state is updated (state = sync + res_ids). When that happens, we regenerate the iter/stream and the next call to format_messages will use that, but the "old" ones that were in the middle when the state update happened will continue to use the old iter/stream.

That seems to be really hard to reproduce in Rust because we went far into no-alloc land which means that the format_messages relies on FluentBundle being allocated in Localization's cache, and that, in turn, requires that the iter/stream remains available until format_messages ends.

In this PR I did a poor solution to the problem - I keep the old iter/stream, and add a "new" one on top to a ChunkVec. This allows me to use the new one in the consequitive calls, while the old one remains available for calls that used it.

What I'm missing is some way of expressing that when nothing requires the old one anymore (all format_message calls that relied on it completed), I'd like to drop it to avoid accumulating memory.

I tried several approaches with Rc/Weak but that didn't work well with the Cow's returned from the method.

I'm not even sure if this is a sound approach, but I'd like to dump it here and ask for help in evaluating whether this is the right approach to resolve the issue.

zbraniecki commented 3 years ago

The final solution is based on the suggestion by @emilio to separate out an Rc<Bundles>> and let the user maintain its liveliness through the use case.

In perf tests against master, this model does not incure any perf hit and it does fix the bug!