unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.38k stars 178 forks source link

Baked data is big, and compiles slowly, for finely sliced data markers #5230

Open sffc opened 4 months ago

sffc commented 4 months ago

icu_datetime compile times have regressed a lot since I added neo datetime data, and https://github.com/unicode-org/icu4x/pull/5221 appears to be choking in CI.

The finely sliced data markers (small data structs designed to work with many data marker attributes) give compelling data sizes and stack sizes in Postcard (https://github.com/unicode-org/icu4x/pull/4818, https://github.com/unicode-org/icu4x/pull/4779). However, in Baked, they significantly increase file size, and the numbers for data size are also not as compelling because baked data includes a lot more pointers (for example, at least 24 bytes for a ZeroVec) which are duplicated for each and every instance of the data struct.

Example data struct that is used in a finely sliced data marker:

pub struct PackedSkeletonDataV1<'data> {
    pub index_info: SkeletonDataIndex,
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub patterns: VarZeroVec<'data, PatternULE>,
}

Some ideas:

  1. Instead of storing many static instances of PackedSkeletonDataV1<'static>, we could instead store many static instances of (SkeletonDataIndex, &[u8]), and build an instance of PackedSkeletonDataV1<'static> at runtime. This is "free", and it should significantly reduce file size, but it causes us to use a Yoke code path.
  2. Make the struct derive VarULE and store all of the data in a big VarZeroVec<PackedSkeletonDataV1ULE>, and build an instance of PackedSkeletonDataV1<'static> at runtime. This should result in the smallest file size and data size, in line with postcard sizes, but is a bit more of a runtime cost since we need to do a VZV lookup. However, it's only one lookup and only when the locale was found, so I don't think we should try to avoid this cost for the sake of avoiding this cost.
  3. Construct static instances via pub fn PackedSkeletonDataV1::new_unchecked(SkeletonDataIndex, &[u8]), reducing file size and therefore probably compile times without changing any runtime characteristics. See https://github.com/unicode-org/icu4x/issues/2452.

@robertbastian @Manishearth @younies

Manishearth commented 4 months ago

2 sounds compelling if we can make it work cleanly in our baking infra

Manishearth commented 4 months ago

potentially have a flag for data keys that marks them as "VZV packable"

sffc commented 4 months ago

Discussed this briefly with @robertbastian. Some points:

To illustrate that last point:

// Data struct type
#[derive(Clone)]
pub struct ThingV1<'data> {
    pub a: VarZeroVec<'data, str>,
    pub b: VarZeroVec<'data, str>,
}

// Borrowed type
#[derive(Copy, Clone)]
pub(crate) struct ThingV1Borrowed<'data> {
    pub a: &'data VarZeroSlice<str>,
    pub b: &'data VarZeroSlice<str>,
}

// Example top-level owned type
pub struct ThingFormatter {
    payload: DataPayload<ThingV1Marker>
}

// Example top-level borrowed type
pub struct ThingFormatterBorrowed<'data> {
    payload: ThingV1Borrowed<'data>
}

// To get from one to the other
impl ThingFormatter {
    pub fn as_borrowed(&self) -> ThingFormatterBorrowed {
        self.payload.get().as_borrowed()
    }
}

In the above example, ThingV1Borrowed can always be constructed from borrowed data, so compiled_data constructors directly to the borrowed type can still work fine.

robertbastian commented 4 months ago
  1. (SkeletonDataIndex, &[u8]) is (SkeletonDataIndex, &VarZeroSlice<'data, PatternULE>). The generic way to do this would be to have a fully borrowed version of each data struct, instead of always having ZeroVec and Cow at the leaves, which have a cost.
  2. I don't see how this would be done. We have a DataPayload<ExportMarker>, and we can call bake() on the data struct. Then what?
  3. Isn't this the same as (1)?
sffc commented 4 months ago

You're right about &[u8] and &VarZeroSlice being the same. I'll switch to using &VarZeroSlice.

Maybe 1, 2, 3 are better illustrated with examples:

// Option 0, Current (not exactly, but equivalent):
payloads: [
    PackedSkeletonDataV1 {
        index_info: SkeletonDataIndex::from_raw(...),
        patterns: VarZeroVec::from_bytes_unchecked(...)
    },
    // ...
]

// Option 1:
payloads: [
    (SkeletonDataIndex::from_raw(...), &'static VarZeroSlice::from_bytes_unchecked(...)),
    // ...
]
// at runtime, use ZeroFrom to get the PackedSkeletonDataV1,
// or put it directly into the borrowed struct

// Option 2:
payloads: VarZeroSlice::from_bytes_unchecked(
    // entries are the VarULE repr of PackedSkeletonDataV1
)
// at runtime, use ZeroFrom to get the PackedSkeletonDataV1,
// or put it directly into the borrowed struct

// Option 3:
impl PackedSkeletonDataV1 {
    pub const unsafe fn from_parts(raw_index_info, raw_patterns) -> Self {
        Self {
            index_info: SkeletonDataIndex::from_raw(raw_index_info),
            patterns: VarZeroVec::from_bytes_unchecked(raw_patterns),
        }
    }
}
payloads: [
    PackedSkeletonDataV1::from_parts(..., ...)
]
// identical runtime characteristics to the current implementation
sffc commented 4 months ago

~Actually, in most constructors we are already using ZeroFrom, so I'm not convinced that option 1 is actually a regression from the status quo. I think it's basically equivalent, we just store &'static VarZeroSlice instead of VarZeroVec<'static> which is 1-2 words smaller.~

robertbastian commented 4 months ago

The baked provider used to use ZeroFrom, but it's not anymore since we added DataPayloadInner::StaticRef, which was a change you can detect in benchmarks.

// correction: currently we have a slice of structs, not an array of struct refs 
payloads: &'static [
    PackedSkeletonDataV1 {
        index_info: SkeletonDataIndex::from_raw(...),
        patterns: VarZeroVec::from_bytes_unchecked(...)
    },
    // ...
]

Option 3 is equivalent to option 1, because payloads is stored in a const. from_parts will be evaluated at compile time, and the data struct stored in the const. If we instead stored the borrowed version of the data struct in the const, we're back at option 1.

sffc commented 4 months ago

Yes, I briefly forgot about DataPayloadInner::StaticRef. So options 0 and 1 are different, as claimed in the OP.

Option 3 is equivalent to the current solution, option 0 (not option 1), except for file size being slightly smaller.

sffc commented 4 months ago

A more radical solution (bigger change but maybe better outcome) would be to add it to DynamicDataMarker

pub trait DynamicDataMarker {
    type Borrowed<'a>;
    // not sure if this is the right syntax but you get the idea:
    type Yokeable: for<'a> Yokeable<'a> + ZeroFrom<Self::Borrowed<'a>>;
}
robertbastian commented 4 months ago

This^ is the only implementation path I see for option 1, what alternative were you thinking of?

sffc commented 4 months ago

The other way to implement option 1 would be to have a databake derive attribute that defines how to get from the static representation to the struct, which we could do as part of #2452. baked_exporter uses it when available, and otherwise serializes the struct directly.

Manishearth commented 4 months ago

One thing databake could potentially do is have a way for the CrateEnv to collect auxiliary codegen, which would work by:

Something like:


struct PackedDataSkeletonV1PatternsAux {
   patterns: Vec<_>,
}

fn bake(&self, env: &CrateEnv) -> TokenStream2 {
   // This uses an anymap or something
   let map = env.get_or_insert::<PackedDataSkeletonV1PatternsAux>(PackedDataSkeletonV1PatternsAux::default(),
         // The flush function. This is just an example.
        |aux| {quote!(  const ALL_THE_PATTERNS = #patterns  )});
   let start = map.patterns.len();
   map.patterns.extend(self.patterns);
  let end = map.patterns.len();
   quote!(PackedSkeletonDataV1 {
     // ...
     patterns: ALL_THE_PATTERNS[#start..#end],
   })
}

This still needs some way to make the types work, the example above doesn't attempt to address that, but this could help for tricks like "store all of the data in a big VarZeroVec<PackedSkeletonDataV1ULE>,"

sffc commented 4 months ago

This is not 2.0 blocking unless we implement a breaking solution for #5187

sffc commented 4 months ago

(additional discussion not recorded)

Current thinking:

LGTM: @sffc @robertbastian

Manishearth commented 4 months ago

LGTM as well

sffc commented 3 months ago

To add some urgency to this issue, @kartva says:

Compiling icu_experimental on main causes rustc to reliably crash on my laptop due to out-of-memory errors. Are there servers or codespaces that people working on icu4x can use?

(icu_experimental and icu_datetime are the two crates with the most finely sliced baked data)

kartva commented 3 months ago

I was previously compiling icu4x with 8 gigabytes of RAM. Right now, I observe RAM usage of up to 11 gigabytes when using baked data.

sffc commented 3 months ago

Building the latest icu_experimental data:

    Command being timed: "cargo build"
    User time (seconds): 30.57
    System time (seconds): 8.20
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:38.41
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 16492072
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 383
    Minor (reclaiming a frame) page faults: 4667399
    Voluntary context switches: 1181
    Involuntary context switches: 529
    Swaps: 0
    File system inputs: 83232
    File system outputs: 752960
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
sffc commented 3 months ago

Here are the figures for impl_units_display_name_v1_marker!(Baked); by itself:

    Command being timed: "cargo build"
    User time (seconds): 18.72
    System time (seconds): 5.43
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.14
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 16230900
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 225
    Minor (reclaiming a frame) page faults: 4573824
    Voluntary context switches: 693
    Involuntary context switches: 726
    Swaps: 0
    File system inputs: 26344
    File system outputs: 322536
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

That's a big share so I'll focus on reproducing just this in isolation.

sffc commented 3 months ago

Here's with changing the paths to be imported instead of absolute:

    Command being timed: "cargo build"
    User time (seconds): 17.70
    System time (seconds): 5.07
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.60
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 16218176
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 4571559
    Voluntary context switches: 434
    Involuntary context switches: 158
    Swaps: 0
    File system inputs: 0
    File system outputs: 403136
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

And here's with using a const constructor:

    Command being timed: "cargo build"
    User time (seconds): 7.46
    System time (seconds): 1.50
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.88
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3983060
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 1208393
    Voluntary context switches: 352
    Involuntary context switches: 59
    Swaps: 0
    File system inputs: 0
    File system outputs: 216696
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

And a helper function:

    Command being timed: "cargo build"
    User time (seconds): 7.71
    System time (seconds): 1.59
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.21
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3947592
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 1198126
    Voluntary context switches: 357
    Involuntary context switches: 269
    Swaps: 0
    File system inputs: 0
    File system outputs: 194544
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

include_bytes! for the trie data:

    Command being timed: "cargo build"
    User time (seconds): 7.26
    System time (seconds): 1.54
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.73
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3961872
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 1202201
    Voluntary context switches: 390
    Involuntary context switches: 65
    Swaps: 0
    File system inputs: 0
    File system outputs: 193984
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
sffc commented 3 months ago

In tabular form (each row builds on the one above it):

Scenario File Size User Time Clock Time Maximum Resident Set Size
Full Experimental Data N/A 30.57 38.41 16492072
Unit Names Only 13070690 18.72 24.14 16230900
With Imports 11301364 17.70 22.60 16218176
With Const Constructor 5038254 7.46 08.88 3983060
With Helper Wrapping Constructor 3923658 7.63 09.00 3953796
With Helper Directly Building 3923875 7.71 09.21 3947592
With include_bytes! trie 3042840 7.26 08.73 3961872

So there seems to be a pretty strong correlation between file size, compile time, and memory usage, except the last few rows with wrapper functions which reduce file size but don't seem to impact the other metrics. include_bytes! doesn't seem to have a big impact, at least not for the big trie byte string.

Manishearth commented 3 months ago

Do we know which compiler pass is actually slow (-Z time-passes, I believe)

Would be useful to compare that output to that of a normal utils crate and see where the big differences are.

sffc commented 2 months ago

Revisiting this since we're getting 143 errors again in CI.

Here's where we're currently at on main:

$ cargo clean; /usr/bin/time -v cargo +nightly build -p icu_experimental --all-features
...
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1m 00s
    Command being timed: "cargo +nightly build -p icu_experimental --all-features"
    User time (seconds): 63.19
    System time (seconds): 18.96
    Percent of CPU this job got: 135%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00.51
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 15888616
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 395
    Minor (reclaiming a frame) page faults: 6703629
    Voluntary context switches: 27185
    Involuntary context switches: 2375
    Swaps: 0
    File system inputs: 33184
    File system outputs: 2091712
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

I ran it with -Z time-passes:

   Compiling icu_experimental v0.1.0 (/usr/local/google/home/sffc/projects/icu4x/components/experimental)
time:   0.000; rss:   47MB ->   48MB (   +1MB)  parse_crate
time:   0.000; rss:   48MB ->   48MB (   +0MB)  incr_comp_prepare_session_directory
time:   0.000; rss:   48MB ->   49MB (   +1MB)  setup_global_ctxt
time:   0.000; rss:   51MB ->   51MB (   +0MB)  crate_injection
time:   1.013; rss:   51MB -> 1182MB (+1130MB)  expand_crate
time:   1.013; rss:   51MB -> 1182MB (+1131MB)  macro_expand_crate
time:   0.000; rss: 1182MB -> 1182MB (   +0MB)  maybe_building_test_harness
time:   0.028; rss: 1182MB -> 1182MB (   +0MB)  AST_validation
time:   0.001; rss: 1182MB -> 1182MB (   +0MB)  finalize_imports
time:   0.004; rss: 1182MB -> 1182MB (   +0MB)  finalize_macro_resolutions
time:   0.479; rss: 1182MB -> 1353MB ( +170MB)  late_resolve_crate
time:   0.023; rss: 1353MB -> 1353MB (   +0MB)  resolve_check_unused
time:   0.044; rss: 1353MB -> 1353MB (   +0MB)  resolve_postprocess
time:   0.551; rss: 1182MB -> 1353MB ( +171MB)  resolve_crate
time:   0.025; rss: 1273MB -> 1273MB (   +0MB)  write_dep_info
time:   0.020; rss: 1273MB -> 1273MB (   +0MB)  complete_gated_feature_checking
time:   0.072; rss: 1612MB -> 1456MB ( -156MB)  drop_ast
time:   1.630; rss: 1273MB -> 1307MB (  +34MB)  looking_for_derive_registrar
time:   1.898; rss: 1273MB -> 1308MB (  +35MB)  misc_checking_1
time:   0.274; rss: 1309MB -> 1331MB (  +22MB)  coherence_checking
time:   6.609; rss: 1308MB -> 1437MB ( +128MB)  type_check_crate
time:  36.802; rss: 1437MB -> 2248MB ( +811MB)  MIR_borrow_checking
time:   2.214; rss: 2248MB -> 2270MB (  +22MB)  MIR_effect_checking
time:   0.130; rss: 2270MB -> 2283MB (  +14MB)  module_lints
time:   0.130; rss: 2270MB -> 2283MB (  +14MB)  lint_checking
time:   0.253; rss: 2283MB -> 2283MB (   +0MB)  privacy_checking_modules
time:   0.061; rss: 2283MB -> 2277MB (   -6MB)  check_lint_expectations
time:   0.526; rss: 2270MB -> 2277MB (   +8MB)  misc_checking_3
time:   0.191; rss: 2301MB -> 2312MB (  +10MB)  monomorphization_collector_graph_walk
time:   0.033; rss: 2312MB -> 2312MB (   +1MB)  partition_and_assert_distinct_symbols
time:   0.587; rss: 2277MB -> 2294MB (  +16MB)  generate_crate_metadata
time:   1.064; rss: 2296MB -> 2320MB (  +24MB)  codegen_to_LLVM_IR
time:   3.448; rss: 2310MB -> 2320MB (  +11MB)  LLVM_passes
time:   3.540; rss: 2294MB -> 2319MB (  +26MB)  codegen_crate
time:   0.000; rss: 2319MB -> 2319MB (   +0MB)  check_dirty_clean
time:   0.358; rss: 2319MB -> 2323MB (   +4MB)  encode_query_results
time:   0.391; rss: 2319MB -> 2318MB (   -1MB)  incr_comp_serialize_result_cache
time:   0.391; rss: 2319MB -> 2318MB (   -1MB)  incr_comp_persist_result_cache
time:   0.392; rss: 2319MB -> 2318MB (   -1MB)  serialize_dep_graph
time:   0.101; rss: 2318MB -> 1314MB (-1004MB)  free_global_ctxt
time:   0.007; rss: 1314MB -> 1314MB (   +0MB)  copy_all_cgu_workproducts_to_incr_comp_cache_dir
time:   0.007; rss: 1314MB -> 1314MB (   +0MB)  finish_ongoing_codegen
time:   0.151; rss: 1288MB -> 1334MB (  +46MB)  link_rlib
time:   0.167; rss: 1288MB -> 1334MB (  +46MB)  link_binary
time:   0.170; rss: 1288MB -> 1279MB (   -9MB)  link_crate
time:   0.179; rss: 1314MB -> 1279MB (  -35MB)  link
time:  54.585; rss:   32MB ->  224MB ( +192MB)  total

Relative to what other crates do, the following steps are the biggest:

Of these, MIR_borrow_checking is by far the slowest relative to other crates; expand_crate and macro_expand_crate add the most to the memory usage.

Reminder from my previous exploration: adding a const constructor instead of doing raw struct construction did reduce compile times and peak memory usage (this was before I was measuring the rustc phase breakdown). This is consistent with @Manishearth's interpretation of the above figures.

sffc commented 2 months ago

I made a reproducible, standalone case here:

https://github.com/sffc/icu4x_compile_sample/tree/standalone

You can also see some of my previous work on making changes in the July 30 commits on the main branch:

https://github.com/sffc/icu4x_compile_sample/commits/main/

sffc commented 2 months ago

I added const constructors where they were most impactful in #5541.

I don't consider this a long-term solution because:

  1. Custom bake impls are unsafe and should generally be avoided
  2. The compile times are cut approximately in half, but they are still too big

But, for the short term, it should make CI less flaky.