unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.33k stars 173 forks source link

Baked data is big, and compiles slowly, for finely sliced data markers #5230

Open sffc opened 1 month ago

sffc commented 1 month ago

icu_datetime compile times have regressed a lot since I added neo datetime data, and https://github.com/unicode-org/icu4x/pull/5221 appears to be choking in CI.

The finely sliced data markers (small data structs designed to work with many data marker attributes) give compelling data sizes and stack sizes in Postcard (https://github.com/unicode-org/icu4x/pull/4818, https://github.com/unicode-org/icu4x/pull/4779). However, in Baked, they significantly increase file size, and the numbers for data size are also not as compelling because baked data includes a lot more pointers (for example, at least 24 bytes for a ZeroVec) which are duplicated for each and every instance of the data struct.

Example data struct that is used in a finely sliced data marker:

pub struct PackedSkeletonDataV1<'data> {
    pub index_info: SkeletonDataIndex,
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub patterns: VarZeroVec<'data, PatternULE>,
}

Some ideas:

  1. Instead of storing many static instances of PackedSkeletonDataV1<'static>, we could instead store many static instances of (SkeletonDataIndex, &[u8]), and build an instance of PackedSkeletonDataV1<'static> at runtime. This is "free", and it should significantly reduce file size, but it causes us to use a Yoke code path.
  2. Make the struct derive VarULE and store all of the data in a big VarZeroVec<PackedSkeletonDataV1ULE>, and build an instance of PackedSkeletonDataV1<'static> at runtime. This should result in the smallest file size and data size, in line with postcard sizes, but is a bit more of a runtime cost since we need to do a VZV lookup. However, it's only one lookup and only when the locale was found, so I don't think we should try to avoid this cost for the sake of avoiding this cost.
  3. Construct static instances via pub fn PackedSkeletonDataV1::new_unchecked(SkeletonDataIndex, &[u8]), reducing file size and therefore probably compile times without changing any runtime characteristics. See https://github.com/unicode-org/icu4x/issues/2452.

@robertbastian @Manishearth @younies

Manishearth commented 1 month ago

2 sounds compelling if we can make it work cleanly in our baking infra

Manishearth commented 1 month ago

potentially have a flag for data keys that marks them as "VZV packable"

sffc commented 1 month ago

Discussed this briefly with @robertbastian. Some points:

To illustrate that last point:

// Data struct type
#[derive(Clone)]
pub struct ThingV1<'data> {
    pub a: VarZeroVec<'data, str>,
    pub b: VarZeroVec<'data, str>,
}

// Borrowed type
#[derive(Copy, Clone)]
pub(crate) struct ThingV1Borrowed<'data> {
    pub a: &'data VarZeroSlice<str>,
    pub b: &'data VarZeroSlice<str>,
}

// Example top-level owned type
pub struct ThingFormatter {
    payload: DataPayload<ThingV1Marker>
}

// Example top-level borrowed type
pub struct ThingFormatterBorrowed<'data> {
    payload: ThingV1Borrowed<'data>
}

// To get from one to the other
impl ThingFormatter {
    pub fn as_borrowed(&self) -> ThingFormatterBorrowed {
        self.payload.get().as_borrowed()
    }
}

In the above example, ThingV1Borrowed can always be constructed from borrowed data, so compiled_data constructors directly to the borrowed type can still work fine.

robertbastian commented 1 month ago
  1. (SkeletonDataIndex, &[u8]) is (SkeletonDataIndex, &VarZeroSlice<'data, PatternULE>). The generic way to do this would be to have a fully borrowed version of each data struct, instead of always having ZeroVec and Cow at the leaves, which have a cost.
  2. I don't see how this would be done. We have a DataPayload<ExportMarker>, and we can call bake() on the data struct. Then what?
  3. Isn't this the same as (1)?
sffc commented 1 month ago

You're right about &[u8] and &VarZeroSlice being the same. I'll switch to using &VarZeroSlice.

Maybe 1, 2, 3 are better illustrated with examples:

// Option 0, Current (not exactly, but equivalent):
payloads: [
    PackedSkeletonDataV1 {
        index_info: SkeletonDataIndex::from_raw(...),
        patterns: VarZeroVec::from_bytes_unchecked(...)
    },
    // ...
]

// Option 1:
payloads: [
    (SkeletonDataIndex::from_raw(...), &'static VarZeroSlice::from_bytes_unchecked(...)),
    // ...
]
// at runtime, use ZeroFrom to get the PackedSkeletonDataV1,
// or put it directly into the borrowed struct

// Option 2:
payloads: VarZeroSlice::from_bytes_unchecked(
    // entries are the VarULE repr of PackedSkeletonDataV1
)
// at runtime, use ZeroFrom to get the PackedSkeletonDataV1,
// or put it directly into the borrowed struct

// Option 3:
impl PackedSkeletonDataV1 {
    pub const unsafe fn from_parts(raw_index_info, raw_patterns) -> Self {
        Self {
            index_info: SkeletonDataIndex::from_raw(raw_index_info),
            patterns: VarZeroVec::from_bytes_unchecked(raw_patterns),
        }
    }
}
payloads: [
    PackedSkeletonDataV1::from_parts(..., ...)
]
// identical runtime characteristics to the current implementation
sffc commented 1 month ago

~Actually, in most constructors we are already using ZeroFrom, so I'm not convinced that option 1 is actually a regression from the status quo. I think it's basically equivalent, we just store &'static VarZeroSlice instead of VarZeroVec<'static> which is 1-2 words smaller.~

robertbastian commented 1 month ago

The baked provider used to use ZeroFrom, but it's not anymore since we added DataPayloadInner::StaticRef, which was a change you can detect in benchmarks.

// correction: currently we have a slice of structs, not an array of struct refs 
payloads: &'static [
    PackedSkeletonDataV1 {
        index_info: SkeletonDataIndex::from_raw(...),
        patterns: VarZeroVec::from_bytes_unchecked(...)
    },
    // ...
]

Option 3 is equivalent to option 1, because payloads is stored in a const. from_parts will be evaluated at compile time, and the data struct stored in the const. If we instead stored the borrowed version of the data struct in the const, we're back at option 1.

sffc commented 1 month ago

Yes, I briefly forgot about DataPayloadInner::StaticRef. So options 0 and 1 are different, as claimed in the OP.

Option 3 is equivalent to the current solution, option 0 (not option 1), except for file size being slightly smaller.

sffc commented 1 month ago

A more radical solution (bigger change but maybe better outcome) would be to add it to DynamicDataMarker

pub trait DynamicDataMarker {
    type Borrowed<'a>;
    // not sure if this is the right syntax but you get the idea:
    type Yokeable: for<'a> Yokeable<'a> + ZeroFrom<Self::Borrowed<'a>>;
}
robertbastian commented 1 month ago

This^ is the only implementation path I see for option 1, what alternative were you thinking of?

sffc commented 1 month ago

The other way to implement option 1 would be to have a databake derive attribute that defines how to get from the static representation to the struct, which we could do as part of #2452. baked_exporter uses it when available, and otherwise serializes the struct directly.

Manishearth commented 1 month ago

One thing databake could potentially do is have a way for the CrateEnv to collect auxiliary codegen, which would work by:

Something like:


struct PackedDataSkeletonV1PatternsAux {
   patterns: Vec<_>,
}

fn bake(&self, env: &CrateEnv) -> TokenStream2 {
   // This uses an anymap or something
   let map = env.get_or_insert::<PackedDataSkeletonV1PatternsAux>(PackedDataSkeletonV1PatternsAux::default(),
         // The flush function. This is just an example.
        |aux| {quote!(  const ALL_THE_PATTERNS = #patterns  )});
   let start = map.patterns.len();
   map.patterns.extend(self.patterns);
  let end = map.patterns.len();
   quote!(PackedSkeletonDataV1 {
     // ...
     patterns: ALL_THE_PATTERNS[#start..#end],
   })
}

This still needs some way to make the types work, the example above doesn't attempt to address that, but this could help for tricks like "store all of the data in a big VarZeroVec<PackedSkeletonDataV1ULE>,"

sffc commented 1 month ago

This is not 2.0 blocking unless we implement a breaking solution for #5187

sffc commented 1 month ago

(additional discussion not recorded)

Current thinking:

LGTM: @sffc @robertbastian

Manishearth commented 1 month ago

LGTM as well

sffc commented 1 month ago

To add some urgency to this issue, @kartva says:

Compiling icu_experimental on main causes rustc to reliably crash on my laptop due to out-of-memory errors. Are there servers or codespaces that people working on icu4x can use?

(icu_experimental and icu_datetime are the two crates with the most finely sliced baked data)

kartva commented 1 month ago

I was previously compiling icu4x with 8 gigabytes of RAM. Right now, I observe RAM usage of up to 11 gigabytes when using baked data.

sffc commented 1 month ago

Building the latest icu_experimental data:

    Command being timed: "cargo build"
    User time (seconds): 30.57
    System time (seconds): 8.20
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:38.41
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 16492072
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 383
    Minor (reclaiming a frame) page faults: 4667399
    Voluntary context switches: 1181
    Involuntary context switches: 529
    Swaps: 0
    File system inputs: 83232
    File system outputs: 752960
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
sffc commented 1 month ago

Here are the figures for impl_units_display_name_v1_marker!(Baked); by itself:

    Command being timed: "cargo build"
    User time (seconds): 18.72
    System time (seconds): 5.43
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.14
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 16230900
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 225
    Minor (reclaiming a frame) page faults: 4573824
    Voluntary context switches: 693
    Involuntary context switches: 726
    Swaps: 0
    File system inputs: 26344
    File system outputs: 322536
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

That's a big share so I'll focus on reproducing just this in isolation.

sffc commented 1 month ago

Here's with changing the paths to be imported instead of absolute:

    Command being timed: "cargo build"
    User time (seconds): 17.70
    System time (seconds): 5.07
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.60
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 16218176
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 4571559
    Voluntary context switches: 434
    Involuntary context switches: 158
    Swaps: 0
    File system inputs: 0
    File system outputs: 403136
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

And here's with using a const constructor:

    Command being timed: "cargo build"
    User time (seconds): 7.46
    System time (seconds): 1.50
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.88
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3983060
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 1208393
    Voluntary context switches: 352
    Involuntary context switches: 59
    Swaps: 0
    File system inputs: 0
    File system outputs: 216696
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

And a helper function:

    Command being timed: "cargo build"
    User time (seconds): 7.71
    System time (seconds): 1.59
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.21
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3947592
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 1198126
    Voluntary context switches: 357
    Involuntary context switches: 269
    Swaps: 0
    File system inputs: 0
    File system outputs: 194544
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

include_bytes! for the trie data:

    Command being timed: "cargo build"
    User time (seconds): 7.26
    System time (seconds): 1.54
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.73
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3961872
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 1202201
    Voluntary context switches: 390
    Involuntary context switches: 65
    Swaps: 0
    File system inputs: 0
    File system outputs: 193984
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
sffc commented 1 month ago

In tabular form (each row builds on the one above it):

Scenario File Size User Time Clock Time Maximum Resident Set Size
Full Experimental Data N/A 30.57 38.41 16492072
Unit Names Only 13070690 18.72 24.14 16230900
With Imports 11301364 17.70 22.60 16218176
With Const Constructor 5038254 7.46 08.88 3983060
With Helper Wrapping Constructor 3923658 7.63 09.00 3953796
With Helper Directly Building 3923875 7.71 09.21 3947592
With include_bytes! trie 3042840 7.26 08.73 3961872

So there seems to be a pretty strong correlation between file size, compile time, and memory usage, except the last few rows with wrapper functions which reduce file size but don't seem to impact the other metrics. include_bytes! doesn't seem to have a big impact, at least not for the big trie byte string.

Manishearth commented 1 month ago

Do we know which compiler pass is actually slow (-Z time-passes, I believe)

Would be useful to compare that output to that of a normal utils crate and see where the big differences are.