PluralRuleStringsV1 should be parsed at data load time

Manishearth commented 3 years ago

Currently they need to be further parsed into a PluralRuleList. It would be nice if that were handled by the data provider itself. We would have to potentially provide utility functions for converting between a PluralRulesStrings and a final parsed PluralRuleList, or perhaps make it so that data providers that can provide PluralRulesStrings can automatically provide a PluralRuleList.

It would also be good if PluralRuleList and RulesSelector could use Cows and borrow from the data provider such that

cc @sffc

The lower level API in https://github.com/unicode-org/icu4x/pull/575 will also need to be updated to handle this.

sffc commented 3 years ago

I'd suggest taking a similar approach as Pattern and Skeleton, where the JSON representation is human-readable, and the Bincode representation is pre-parsed. When JSON gets read in, the strings should be parsed, and when Bincode gets read in, you just need to point to the data.

Manishearth commented 3 years ago

We'll work on https://github.com/unicode-org/icu4x/issues/663 first so that we know what FFI footguns are in the offing with this. It's possible to work on this now (and folks should feel free to pick it up!), but FFI may break.

sffc commented 3 years ago

Here's a model for how to represent plural rules as a zero-copy data structure: VarZeroVec<AndOrRelation>, which is a VarZeroVec<ZeroVec> with some metadata at each level.

This covers all cases, is compatible with UTS 35, and does not require infinite nesting.

Rust Structures

Here is AndOrRelation:

enum AndOr { And, Or };
enum Polarity { Positive, Negative };
struct AndOrRelation {
  and_or: AndOr,  # first entry is Or
  operand: PluralOperand,  # i, u, v, f, ...
  modulo: u32,
  polarity: Polarity,
  range_list: ZeroVec<RangeOrValue>,
};

And RangeOrValue:

enum RangeOrValue {
  Range(u32, u32),
  Value(u32),
}

Algorithm

I claim that the algorithm is just as fast as a more highly nested AST structure. Pseudo-code:

Let result = False.
For each relation in relation list:
  If relation.and_or == "and" and result == False:
    Next iteration.
  If relation.and_or == "or" and result == True:
    Return True.
  result = relation.compute(n).
Return result.

Byte Representation

Relation can be represented in 5 bytes plus the ZeroVec:

and_or: 1 bit
polarity: 1 bit
operand: say 6 bits (2^6 = 64 choices, far more than we currently need to support)
modulo*: 32 bits

* The modulo could likely be compacted further, given that virtually all modulos are on powers of 10.

RangeOrValue can be represented in 8 bytes:

Start of range: u32
End of range or scalar: u32
Discriminant: Whether the start of range and end of range are equal

Example

Rule string: "n % 10 = 3..4,9 and n % 100 != 10..19,70..79,90..99 or n = 0"

This rule string contains 3 operations. A JSON-like expansion into the above schema would be:

[
  {
    and_or: "or",  // first entry is "or"
    operand: "n",
    modulo: 10,
    polarity: "positive",
    range_list: [
      [3, 4],
      9
    ]
  },
  {
    and_or: "and",
    operand: "n",
    modulo: 100,
    polarity: "negative",
    range_list: [
      [10, 19],
      [70, 79],
      [90, 99]
    ]
  },
  {
    and_or: "or",
    operand: "n",
    modulo: 1,
    polarity: "positive",
    range_list: [
      0
    ]
  }
]

The bytes:

First Relation: 25 bytes
- 5 bytes for the Relation header
- 4 bytes for the ZeroVec length (I would really like to be able to configure this)
- 8*2 bytes for the content of the ZeroVec
Second Relation: 33 bytes
- 5 bytes for the Relation header
- 4 bytes for the ZeroVec length
- 8*3 bytes for the content of the ZeroVec
Third Relation: 17 bytes
- 5 bytes for the Relation header
- 4 bytes for the ZeroVec length
- 8 bytes for the content of the ZeroVec

Total: 75 bytes.

For comparison, the string is 60 bytes. So we are a bit bigger, but not too much bigger, and there are opportunities to optimize the byte length:

The modulo can probably be stored in a single byte, since almost all instances are 1 to a power of 10
If the ZeroVec has a length of 1, we can probably avoid storing the length by sticking another bit discriminant somewhere in the Relation header
I highly suspect that the we only need u16 instead of u32 for the RangeList pair

With these optimizations, the byte length would become:

First Relation: 14 bytes
- 2 bytes for the Relation header
- 4 bytes for the ZeroVec length (I would really like to be able to configure this)
- 4*2 bytes for the content of the ZeroVec
Second Relation: 18 bytes
- 2 bytes for the Relation header
- 4 bytes for the ZeroVec length
- 4*3 bytes for the content of the ZeroVec
Third Relation: 6 bytes
- 2 bytes for the Relation header
- 4 bytes for the singleton entry of the ZeroVec

Total: 38 bytes! Smaller than even the string representation.

Note: The above size does not include the VarZeroVec's own header, which will likely incur another 16ish bytes.

Manishearth commented 3 years ago

This sounds like it could benefit from the custom derive, though a couple issues are that it's harder to achieve bitpacking with a custom derive, and also I don't think a custom derive for AsULE can handle enums.

So might be better to write some custom packed ULE types.

Manishearth commented 3 years ago

@zbraniecki https://github.com/unicode-org/icu4x/issues/1078 is resolved, hopefully that unblocks this

zbraniecki commented 2 years ago

Fixed in #1240

unicode-org / icu4x