Closed Manishearth closed 2 years ago
I'd suggest taking a similar approach as Pattern and Skeleton, where the JSON representation is human-readable, and the Bincode representation is pre-parsed. When JSON gets read in, the strings should be parsed, and when Bincode gets read in, you just need to point to the data.
We'll work on https://github.com/unicode-org/icu4x/issues/663 first so that we know what FFI footguns are in the offing with this. It's possible to work on this now (and folks should feel free to pick it up!), but FFI may break.
Here's a model for how to represent plural rules as a zero-copy data structure: VarZeroVec<AndOrRelation>
, which is a VarZeroVec<ZeroVec>
with some metadata at each level.
This covers all cases, is compatible with UTS 35, and does not require infinite nesting.
Here is AndOrRelation
:
enum AndOr { And, Or };
enum Polarity { Positive, Negative };
struct AndOrRelation {
and_or: AndOr, # first entry is Or
operand: PluralOperand, # i, u, v, f, ...
modulo: u32,
polarity: Polarity,
range_list: ZeroVec<RangeOrValue>,
};
And RangeOrValue
:
enum RangeOrValue {
Range(u32, u32),
Value(u32),
}
I claim that the algorithm is just as fast as a more highly nested AST structure. Pseudo-code:
Let result = False.
For each relation in relation list:
If relation.and_or == "and" and result == False:
Next iteration.
If relation.and_or == "or" and result == True:
Return True.
result = relation.compute(n).
Return result.
Relation
can be represented in 5 bytes plus the ZeroVec:
* The modulo could likely be compacted further, given that virtually all modulos are on powers of 10.
RangeOrValue
can be represented in 8 bytes:
Rule string: "n % 10 = 3..4,9 and n % 100 != 10..19,70..79,90..99 or n = 0"
This rule string contains 3 operations. A JSON-like expansion into the above schema would be:
[
{
and_or: "or", // first entry is "or"
operand: "n",
modulo: 10,
polarity: "positive",
range_list: [
[3, 4],
9
]
},
{
and_or: "and",
operand: "n",
modulo: 100,
polarity: "negative",
range_list: [
[10, 19],
[70, 79],
[90, 99]
]
},
{
and_or: "or",
operand: "n",
modulo: 1,
polarity: "positive",
range_list: [
0
]
}
]
The bytes:
Total: 75 bytes.
For comparison, the string is 60 bytes. So we are a bit bigger, but not too much bigger, and there are opportunities to optimize the byte length:
With these optimizations, the byte length would become:
Total: 38 bytes! Smaller than even the string representation.
Note: The above size does not include the VarZeroVec's own header, which will likely incur another 16ish bytes.
This sounds like it could benefit from the custom derive, though a couple issues are that it's harder to achieve bitpacking with a custom derive, and also I don't think a custom derive for AsULE can handle enums.
So might be better to write some custom packed ULE types.
@zbraniecki https://github.com/unicode-org/icu4x/issues/1078 is resolved, hopefully that unblocks this
Fixed in #1240
Currently they need to be further parsed into a
PluralRuleList
. It would be nice if that were handled by the data provider itself. We would have to potentially provide utility functions for converting between aPluralRulesStrings
and a final parsedPluralRuleList
, or perhaps make it so that data providers that can providePluralRulesStrings
can automatically provide aPluralRuleList
.It would also be good if
PluralRuleList
andRulesSelector
could use Cows and borrow from the data provider such thatcc @sffc
The lower level API in https://github.com/unicode-org/icu4x/pull/575 will also need to be updated to handle this.