unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.38k stars 176 forks source link

How should we build static data? #78

Closed sffc closed 3 years ago

sffc commented 4 years ago

When we don't want to perform file or network I/O, it may be necessary to build the data into a binary format that can be built into the Rust static memory space.

We can use a solution like CBOR or bincode.

However, these solutions require parsing the data structure into Rust objects at runtime. That should be pretty quick, and strings buffers could point into the static memory, but a pure Rust static structure would likely be the fastest. However, I have not yet done code size or performance testing.

zbraniecki commented 4 years ago

I've been successfully using https://crates.io/crates/bincode for that purpose in my experimental datetime impl - https://github.com/zbraniecki/unic-datetime

Not sure what are the pros/cons of either method, but the numbers in my measurements use bincode vs. json (both serde).

Since those two are very much alike, I hope we'll be able to load either JSON or bincode resources provided with the library.

Separately, we could also make it possible to produce an .rs file that can store the structs, here's an example of a polish datetime CLDR data generated from JSON into a Rust file: https://github.com/zbraniecki/unic-datetime/blob/master/src/data/generated/pl.rs

sffc commented 4 years ago

What are the pros and cons of using something like bincode or CBOR versus a .rs file of structs?

zbraniecki commented 4 years ago

Pros:

Cons:

My thinking was that this would be useful for scenarios where we want to have some basic data to fallback on when dynamically loaded data fails, data changes very rarely, and the crate may be useful with that little chunk of data available at all times.

Examples: list of RTL languages for directionality.

sffc commented 4 years ago

Is there a way to auto-generate .rs files from a JSON file conforming to Serde? Maybe using a proc macro?

zbraniecki commented 4 years ago

I don't know of one, but it should be relatively easy to write one. I must admit the method I used is very ugly, and I wouldn't recommend it. I basically println! the Rust file.

If I were to redo it, I'd likely use quote to produce Rust or write my own serde serializer to rust structs.

@Manishearth - do you know of anyone who's done it before? Any libs?

Manishearth commented 4 years ago

For specific structs you should probably write your own deserializer. I don't think there's anything for that in general, but it can be written.

Manishearth commented 4 years ago

Actually, I'd just parse it into a hashmap and operate on that. No need to write a deserializer, those are tricky.

sffc commented 4 years ago

I'm surprised that there isn't apparently an existing crate to generate build-time Rust structures in static memory from a JSON file.

I guess we should either:

  1. Write such a crate as a standalone Serde plugin (intern project?).
  2. Punt on this until someone else writes such a crate.
zbraniecki commented 4 years ago

I agree. It would be sweet to have a serde plugin that can output Rust. I think the challenge is that what Rust it outputs may be project dependent, but I'm comfortable delaying that experiment until we have some serde output for CBOR/bincode.

I've written some manual serde deserializers, so may dive into it one day :)

sffc commented 4 years ago

I'm unassigning myself. For the time being, we can use a JSON string or CBOR/bincode. In the medium term, we should have the auto-generated Rust code, and we should evaluate to what degree that is faster than deserializing JSON/CBOR/bincode at runtime.

Could be an intern project, or could be something good for the Mozilla team, since it is more closely related to the Rust ecosystem and likely wants to be its own project.

sffc commented 4 years ago

Moved to backlog.

sffc commented 4 years ago

@zbraniecki saw this post on Reddit: https://www.reddit.com/r/rust/comments/gud6c1/introducing_treebuf/

zbraniecki commented 4 years ago

As part of #116 I started looking into performance of various resource types based on https://github.com/unicode-cldr/cldr-core/blob/master/supplemental/plurals.json and https://github.com/unicode-cldr/cldr-core/blob/master/supplemental/ordinals.json .

The branch I work on supports parsing PluralRules at runtime (I expect to add ability to load inlined data as well) either from inlined strings, or from JSON, Bincode and CBOR.

I used my harness from intl-measurements using runtime (not criterion) to emulate the cost of loading data to just format 31 numbers in 10 locales. The data is totally not optimized, so this data is only valuable to compare JSON/Bincode/CBOR, not representative of what perf I expect out of PluralRules when the PR is ready.

I also compared the size of resource files for cardinals.

Type Time Size
Inline String 39344 ns ?
Bincode 137845 ns 57k
CBOR 172867 ns 59k
JSON 201755 ns 69k

My conclusion is that CBOR is slower than Bincode in deserialization and Bincode gives us smaller resource files (although only slightly)

sffc commented 4 years ago

Agreed that Bincode looks nicer based on this initial data.

What are the runtime numbers for exactly? Are you parsing all of the cardinal rules for all locales? 137845 ns is faster than CBOR and JSON, but still really slow by absolute value.

zbraniecki commented 4 years ago

Here's the code: https://github.com/zbraniecki/intl-measurements/blob/master/unic/pluralrules/src/main.rs#L52-L61

So, it's 10x loading, parsing, and resolving for 31 sample numbers.

sffc commented 4 years ago

OK, thanks for that.

  1. Is it correct to say that the runtime for the actual .select() function should not be affected by which strategy is used for data serialization?
  2. It would be good to get data for PluralRules::create by itself, including loading from a static string, so that we can better break down these numbers into what part of the stack they originate from:
    1. Deserializing the data format (JSON, CBOR, or Bincode) into a rule string
    2. Parsing the rule string
    3. Evaluating on .select()
zbraniecki commented 4 years ago
  1. Is it correct to say that the runtime for the actual .select() function should not be affected by which strategy is used for data serialization?

Yes.

  1. It would be good to get data for PluralRules::create by itself, including loading from a static string, so that we can better break down these numbers into what part of the stack they originate from.

The inline string data point is for when there's no I/O and no parsing Resource.

I measured just selecting for 10x31. It takes ~8000 ns.

Parsing the rule string for 10x31 is ~30000 ns.

So, substracting that shared piece from my numbers, loading the data into that structure: https://github.com/zbraniecki/icu4x/blob/pluralrules/components/pluralrules/src/data/json.rs#L12-L26

takes:

zbraniecki commented 4 years ago

Also, as I said in my earlier post - I have a number of low hanging fruits for perf improvements in parsing and selecting. The 30000ns of 10 locales x 2 rules (cardinal/ordinal) (which means ~1500ns per rule) and 8000ns of 10x31x2 in selection (which means 13ns per selection) is still a subject to further optimization.

The manual parser I wrote there is ~3 times faster than nom based parser generator we used in https://github.com/zbraniecki/pluralrules but it still can be faster :)

My comment here is mostly about the conclusion about JSON/Bincode/CBOR loading/deserializing using serde/bincode. It's very consistent with my results for DateTime between JSON/Bincode/Inline - https://github.com/zbraniecki/intl-measurements/ (30us slower)

sffc commented 4 years ago

Moving the PluralRules-specific stuff to #130

zbraniecki commented 4 years ago

I found two crates that may be able to generate rust - https://lib.rs/crates/serde-generate and https://github.com/udoprog/genco

sffc commented 4 years ago

We should use StaticDataProvider in example code so that the examples aren't dependent on the filesystem machinery.

sffc commented 3 years ago

Consider whether rkyv can be used here: https://davidkoloski.me/rkyv/rkyv.html

sffc commented 3 years ago

Here's what I'm currently thinking.

Sticking with a serialization/deserialization framework seems like the most robust and future-proof option. Rust code generation is nice, but there are problems with it, such as wanting to share data across multiple ICU4X instances.

I'm thinking we should take a two-layered approach to the static (filesystem-less) serialization:

  1. Map from resource paths (keys and options) to buffers
  2. Map from buffers to structs

Here's what I mean. On the filesystem, the resource paths are encoded into the filename. However, the schema defining what's inside of those files is not known until the data is requested. I think we should do the same thing with static data: the main data structure maps from resource paths to opaque buffers, and then when the data is requested, we map from the opaque buffer to the specific data struct.

Advantages of this model:

  1. Should help with startup time, because we don't need to eagerly deserialize all the data when the app is first loaded.
  2. The data provider can be agnostic to the identity of the data in the buffers, increasing modularity.

Disadvantages:

  1. May increase the per-instance creation time if no data cache is used.
sffc commented 3 years ago

Zibi linked the following benchmark comparison between deserialization frameworks: https://davidkoloski.me/blog/rkyv-is-faster-than/

Postcard (https://docs.rs/postcard/0.5.2/postcard/) looks quite promising, as well as rkyv.

zbraniecki commented 3 years ago

@Manishearth said he'll take on investigating postcard as a replcament for bincode and rkyv as an option for static.

sffc commented 3 years ago

I did some experiments with Rkyv, Bincode, and Postcard on LiteMap, mainly focusing on code size. I compressed the example data in language_names_lite_map.rs into each of the three binary serialization formats under consideration.

Bincode example:

const BYTES: [u8; 278] = [ ... ];
let map: LiteMap<&str, &str> = bincode::deserialize(&BYTES).unwrap();

Postcard example:

const POSTCARD: [u8; 117] = [ ... ];
let map: LiteMap<&str, &str> = postcard::from_bytes(&POSTCARD).unwrap();

Rkyv example:

const BYTES: Aligned<[u8; 280]> = Aligned([ ... ]);
let archived = unsafe { archived_root::<LiteMap<String, String>>(&BYTES.0) };
let deserialized = archived.deserialize(&mut AllocDeserializer).unwrap();

Comparison:

Bincode Postcard Rkyv
Data Size 278 B 117 B 280 B
Deserializer Code Size 24449 B 25923 B 8052 B

So Postcard wins on data size, and Rkyv on code size. Most of the code size in both Bincode and Postcard is Serde error handling stuff.

Here are some caveats of Rkyv. UPDATE: I did some more experimenting and am updating the caveats that I had posted earlier.

The crucial architectural design difference between Serde and Rkyv is in how it deals with deserialization types. In Serde, the data struct has a lifetime parameter, and zero-copy deserialized data gets referenced from the data struct. However, in Rkyv, zero-copy deserialized data does not go into the main data struct; it goes into a special intermediate struct with the same fields as the main struct but wrapped with RelPtr.

Rkyv supports serializing data structs that have a lifetime parameter, but doesn't appear to support deserializing into them; you must use the Archive type with RelPtr instead of the real data struct.

I don't know what this would mean for the data provider pipeline, which up to this point has always assumed that the original data struct type is the one being passed around. The only solution I can think of right now is to make data struct types generic in formatters, such that something like DateTimeFormat can own either the plain data struct or the Rkyv version of the data struct.

zbraniecki commented 3 years ago

Have you manage to use the release from 8h ago of rkyv - https://github.com/djkoloski/rkyv/releases/tag/v0.5.0 ? If not, would it be hard to repro with it?

Also, seems like the author is seeking feedback for their roadmap - https://github.com/djkoloski/rkyv/issues/67

sffc commented 3 years ago

Yes, actually I do believe I was using version 0.5 in my testing. I picked whatever was the latest version on crates.io when I started, which was probably 30 minutes after it was pushed. :smiley:

Thanks for the link; I'll try to organize my thoughts into suggestions on Rkyv.

sffc commented 3 years ago

It's a new day and I'm having trouble reproducing the results I posted last night. Here is what I'm getting this morning for deserializer code size:

Bincode Postcard Rkyv
24083 B 11274 B 6913 B

Will continue investigating.

sffc commented 3 years ago

Some more data:

Bincode Postcard Rkyv (Deserialize) Rkyv (Archive)
Code Size 24083 B 11274 B 6913 B 463 B
Data Size 278 B 117 B 280 B 280 B
Max Heap Size 352 B 352 B 629 B 0 B
Instruction Count 121228 Ir 129691 Ir 126651 Ir 2984 Ir

Looking inside the WebAssembly bytecode:

More background on the tests:

You can see the code here:

https://github.com/sffc/omnicu/tree/static-testing/utils/litemap/benches/bin

sffc commented 3 years ago

@zbraniecki I posted some thoughts in https://github.com/djkoloski/rkyv/issues/67. Feel free to add more as you see fit. Thanks for the suggestion!

sffc commented 3 years ago

@sffc posted the following in Slack:

Okay, so my conclusion from my Rkyv investigation is that I think we could make Rkyv work, and it would give us quite compelling results in code size and performance. The downsides are:

  • Data size is mostly on par with Bincode, but rather bloated when compared to Postcard
  • It requires rethinking a number of the assumptions that went into the current design of the data provider

How do you suggest we proceed? Note that because of (2), I don't see a path toward a "proof of concept" of Rkyv in data provider without essentially writing the final solution.

Conclusion:

sffc commented 3 years ago

FYI, I wrote up the following example of how we could structure data structs if we went the Rkyv route.

use std::borrow::Borrow;
use std::borrow::Cow;

#[derive(serde::Serialize, serde::Deserialize)]
#[derive(rkyv::Serialize, rkyv::Archive)]
pub struct DataStruct {
    message: String,
}

trait DataStructTrait {
    fn message(&self) -> &str;
}

impl DataStructTrait for DataStruct {
    fn message(&self) -> &str {
        &self.message
    }
}

impl DataStructTrait for ArchivedDataStruct {
    fn message(&self) -> &str {
        self.message.as_str()
    }
}

fn demo(data: &(impl DataStructTrait + ?Sized)) {
    println!("{}", data.message());
}
Manishearth commented 3 years ago

trait DataStructTrait {
    fn message(&self) -> &str;
}

impl DataStructTrait for DataStruct {
    fn message(&self) -> &str {
        &self.message
    }
}

impl DataStructTrait for ArchivedDataStruct {
    fn message(&self) -> &str {
        self.message.as_str()
    }
}

We can also write our own proc macro that generates these accessors, if we need. This might make it easier to incrementally add rkyv without breaking all the other code.

sffc commented 3 years ago

Okay, so taking a step back. There are two main data types we want to support efficiently:

  1. Long lists of integers (for TinyStr, UnicodeSet inversion list, CodePointTrie)
  2. Long lists of strings (for most other locale data)

After discussing this at length with @Manishearth, here's a possible direction that satisfies our needs.

Lists of integers

Neither Serde nor Rkyv is perfect with integer lists out of the box. Serde only supports lists of u8, choosing not to concern itself with alignment and endianness. Rkyv uses padding to align integers properly (wasting space), and declares endianness out of scope, meaning that we would need to handle it externally, such as by shipping separate big-endian and little-endian data bundles.

Our idea (mainly @Manishearth's) to serialize a list of u32s is as follows:

  1. Create a new type, named something like UnalignedU32, represented as [u8; 4] under the hood, where the bytes are stored in a constant order (probably little-endian).
  2. Add lots of convenience methods to convert between UnalignedU32 and normal u32.
  3. To serialize a &[u32], cast it to a &[UnalignedU32] and then serialize it as a byte stream. Do deserialize, cast the &[u8] to a &[UnalignedU32], which can now be passed to anything that implements &[T] where T: Into<u32> + Copy.

It's unclear how we would support lists of TinyStr4 (or 8, 16) in this model. Some options:

  1. Create a new type UnalignedTinyStr4
  2. Ditch TinyStr4 in these contexts and use [u8; N] instead, where N is the desired string length

Lists of strings

Serde does not really have any good out-of-the-box solution for zero-copy lists of strings. Rkyv uses an ArchivedVec<ArchivedString> (I believe), which allows random-access into the vector followed by a relative pointer lookup to the start of the string.

The basic model we came up with for zero-copy lists of strings is as follows:

  1. Replace &[&str] (or Vec<String>) with ByteSliceVec<T> where T: TryFrom<&[u8]>
  2. Under the hood, represent ByteSliceVec as a &[u8] where the first N bytes are offsets, and the remaining N bytes contain a length and a byte slice (the indices and lengths could be either 1-byte or 2-byte, supporting lists of length 256 or 65536)
  3. Accesses into the BytesSliceVec<T> will fetch the appropriate &[u8] and then pass it through TryFrom. If desired, we could instead pre-validate the strings in the Deserialize impl of ByteSliceVec<T> and then use unsafe to convert the &[u8] to &str during data access to speed things up.

In the future, this model could be generalized to support interning/pooling of strings like ICU resource bundles are doing.

Pre-validation or post-validation

When performing zero-copy deserialization of types that are not valid for all possible bit patterns (such as chars and enums), it is necessary to validate the data before it makes it into the application. There are two opportunities to validate the data:

  1. During deserialization (serde::Deserialize). We would need to walk the whole byte array and call TryFrom<[u8]> to verify that the entries are all valid.
  2. During lookup. Every access into the vector would need to perform the validation.

I lean toward (1), because it's okay to incur a little bit of extra cost during deserialization if we can win inside tight format loops.

What this means for ICU4X and DataProvider

The key insight from above is that we can introduce new types that handle alignment and endianness issues while able to be represented in memory as a &[u8], compatible with the Serde data model.

If we had a data struct like

#[derive(serde::Serialize, serde::Deserialize)]
pub struct DataStruct<'s> {
    pub numbers: Vec<u32>,
    #[serde(borrow)]
    pub strings: Vec<Cow<'s, str>>,
}

this would become something like

#[derive(serde::Serialize, serde::Deserialize)]
pub struct DataStruct<'s> {
    #[serde(borrow)]
    pub numbers: Cow<'s, [UnalignedU32]>,
    #[serde(borrow)]
    pub strings: BytesSliceVec<'s, str>,
}

Advantages:

Disadvantages:

What all this means is that we would basically be building a handful of helper types for Serde to extend support for zero-copy deserialization into new types. We could put these in a new crate. Crate names could include:

I'm happy to work on these new unaligned types in parallel with the rest of the work in this thread. Now that we have an idea for a direction on how we plan to support this, I feel much more comfortable moving forward with the status quo serializers (Bincode and JSON), and then we can take our time to benchmark the improvements that these new types bring to the table.

sffc commented 3 years ago

Here are some benchmarks that we could plug into to test the new types:

djkoloski commented 3 years ago

Don't mean to butt in, but I'll be adding support for endian-agnostic archives as part of the next rkyv milestone. You'll still have to make sure your own types are endian-agnostic but it should make it much easier to do so.

Manishearth commented 3 years ago

It's unclear how we would support lists of TinyStr4 (or 8, 16) in this model. Some options:

I think we can actually design a generic type that can handle any flat type, if const generics are strong enough. Otherwise, we can basically do the same thing for u32 and TinyStr4


@djkoloski yep, another issue is that we need these types to also be patchable, and that's trickier with rkyv's entirely separate data model.

sffc commented 3 years ago

This is finally done!

https://unicode-org.github.io/icu4x-docs/doc/icu_provider_blob/index.html