Baked data is big, and compiles slowly, for finely sliced data markers

sffc commented 4 months ago

icu_datetime compile times have regressed a lot since I added neo datetime data, and https://github.com/unicode-org/icu4x/pull/5221 appears to be choking in CI.

The finely sliced data markers (small data structs designed to work with many data marker attributes) give compelling data sizes and stack sizes in Postcard (https://github.com/unicode-org/icu4x/pull/4818, https://github.com/unicode-org/icu4x/pull/4779). However, in Baked, they significantly increase file size, and the numbers for data size are also not as compelling because baked data includes a lot more pointers (for example, at least 24 bytes for a ZeroVec) which are duplicated for each and every instance of the data struct.

Example data struct that is used in a finely sliced data marker:

pub struct PackedSkeletonDataV1<'data> {
    pub index_info: SkeletonDataIndex,
    #[cfg_attr(feature = "serde", serde(borrow))]
    pub patterns: VarZeroVec<'data, PatternULE>,
}

Some ideas:

Instead of storing many static instances of PackedSkeletonDataV1<'static>, we could instead store many static instances of (SkeletonDataIndex, &[u8]), and build an instance of PackedSkeletonDataV1<'static> at runtime. This is "free", and it should significantly reduce file size, but it causes us to use a Yoke code path.
Make the struct derive VarULE and store all of the data in a big VarZeroVec<PackedSkeletonDataV1ULE>, and build an instance of PackedSkeletonDataV1<'static> at runtime. This should result in the smallest file size and data size, in line with postcard sizes, but is a bit more of a runtime cost since we need to do a VZV lookup. However, it's only one lookup and only when the locale was found, so I don't think we should try to avoid this cost for the sake of avoiding this cost.
Construct static instances via pub fn PackedSkeletonDataV1::new_unchecked(SkeletonDataIndex, &[u8]), reducing file size and therefore probably compile times without changing any runtime characteristics. See https://github.com/unicode-org/icu4x/issues/2452.

@robertbastian @Manishearth @younies

Manishearth commented 4 months ago

2 sounds compelling if we can make it work cleanly in our baking infra

Manishearth commented 4 months ago

potentially have a flag for data keys that marks them as "VZV packable"

sffc commented 4 months ago

Discussed this briefly with @robertbastian. Some points:

Basically the proposal is to store data as immutable slices instead of references to mutable structs
ZeroFrom is costly for nested structs. How costly is it for shallow ones?
Fully borrowed constructors like those in icu_properties are still possible but are a little bit more intricate because we end up with 3 states of the data struct: yoked, borrowed-from-yoke, and fully-borrowed, where the fully-borrowed type is separate from the data struct type and is the one used internally in baked.

To illustrate that last point:

// Data struct type
#[derive(Clone)]
pub struct ThingV1<'data> {
    pub a: VarZeroVec<'data, str>,
    pub b: VarZeroVec<'data, str>,
}

// Borrowed type
#[derive(Copy, Clone)]
pub(crate) struct ThingV1Borrowed<'data> {
    pub a: &'data VarZeroSlice<str>,
    pub b: &'data VarZeroSlice<str>,
}

// Example top-level owned type
pub struct ThingFormatter {
    payload: DataPayload<ThingV1Marker>
}

// Example top-level borrowed type
pub struct ThingFormatterBorrowed<'data> {
    payload: ThingV1Borrowed<'data>
}

// To get from one to the other
impl ThingFormatter {
    pub fn as_borrowed(&self) -> ThingFormatterBorrowed {
        self.payload.get().as_borrowed()
    }
}

In the above example, ThingV1Borrowed can always be constructed from borrowed data, so compiled_data constructors directly to the borrowed type can still work fine.

robertbastian commented 4 months ago

(SkeletonDataIndex, &[u8]) is (SkeletonDataIndex, &VarZeroSlice<'data, PatternULE>). The generic way to do this would be to have a fully borrowed version of each data struct, instead of always having ZeroVec and Cow at the leaves, which have a cost.
I don't see how this would be done. We have a DataPayload<ExportMarker>, and we can call bake() on the data struct. Then what?
Isn't this the same as (1)?

sffc commented 4 months ago

You're right about &[u8] and &VarZeroSlice being the same. I'll switch to using &VarZeroSlice.

Maybe 1, 2, 3 are better illustrated with examples:

// Option 0, Current (not exactly, but equivalent):
payloads: [
    PackedSkeletonDataV1 {
        index_info: SkeletonDataIndex::from_raw(...),
        patterns: VarZeroVec::from_bytes_unchecked(...)
    },
    // ...
]

// Option 1:
payloads: [
    (SkeletonDataIndex::from_raw(...), &'static VarZeroSlice::from_bytes_unchecked(...)),
    // ...
]
// at runtime, use ZeroFrom to get the PackedSkeletonDataV1,
// or put it directly into the borrowed struct

// Option 2:
payloads: VarZeroSlice::from_bytes_unchecked(
    // entries are the VarULE repr of PackedSkeletonDataV1
)
// at runtime, use ZeroFrom to get the PackedSkeletonDataV1,
// or put it directly into the borrowed struct

// Option 3:
impl PackedSkeletonDataV1 {
    pub const unsafe fn from_parts(raw_index_info, raw_patterns) -> Self {
        Self {
            index_info: SkeletonDataIndex::from_raw(raw_index_info),
            patterns: VarZeroVec::from_bytes_unchecked(raw_patterns),
        }
    }
}
payloads: [
    PackedSkeletonDataV1::from_parts(..., ...)
]
// identical runtime characteristics to the current implementation

sffc commented 4 months ago

~Actually, in most constructors we are already using ZeroFrom, so I'm not convinced that option 1 is actually a regression from the status quo. I think it's basically equivalent, we just store &'static VarZeroSlice instead of VarZeroVec<'static> which is 1-2 words smaller.~

robertbastian commented 4 months ago

The baked provider used to use ZeroFrom, but it's not anymore since we added DataPayloadInner::StaticRef, which was a change you can detect in benchmarks.

// correction: currently we have a slice of structs, not an array of struct refs 
payloads: &'static [
    PackedSkeletonDataV1 {
        index_info: SkeletonDataIndex::from_raw(...),
        patterns: VarZeroVec::from_bytes_unchecked(...)
    },
    // ...
]

Option 3 is equivalent to option 1, because payloads is stored in a const. from_parts will be evaluated at compile time, and the data struct stored in the const. If we instead stored the borrowed version of the data struct in the const, we're back at option 1.

sffc commented 4 months ago

Yes, I briefly forgot about DataPayloadInner::StaticRef. So options 0 and 1 are different, as claimed in the OP.

Option 3 is equivalent to the current solution, option 0 (not option 1), except for file size being slightly smaller.

sffc commented 4 months ago

A more radical solution (bigger change but maybe better outcome) would be to add it to DynamicDataMarker

pub trait DynamicDataMarker {
    type Borrowed<'a>;
    // not sure if this is the right syntax but you get the idea:
    type Yokeable: for<'a> Yokeable<'a> + ZeroFrom<Self::Borrowed<'a>>;
}

robertbastian commented 4 months ago

This^ is the only implementation path I see for option 1, what alternative were you thinking of?

sffc commented 4 months ago

The other way to implement option 1 would be to have a databake derive attribute that defines how to get from the static representation to the struct, which we could do as part of #2452. baked_exporter uses it when available, and otherwise serializes the struct directly.

Manishearth commented 4 months ago

One thing databake could potentially do is have a way for the CrateEnv to collect auxiliary codegen, which would work by:

CrateEnv gets a map you can append to, that has some form of key and in the value it has a list of things and a "flush" function that produces additional const data
individual databake impls when they need to append auxiliary not-per-value data start appending to the appropriate entry on this map, and then usign a constnat that refeerences the data that shall be generated.

Something like:


struct PackedDataSkeletonV1PatternsAux {
   patterns: Vec<_>,
}

fn bake(&self, env: &CrateEnv) -> TokenStream2 {
   // This uses an anymap or something
   let map = env.get_or_insert::<PackedDataSkeletonV1PatternsAux>(PackedDataSkeletonV1PatternsAux::default(),
         // The flush function. This is just an example.
        |aux| {quote!(  const ALL_THE_PATTERNS = #patterns  )});
   let start = map.patterns.len();
   map.patterns.extend(self.patterns);
  let end = map.patterns.len();
   quote!(PackedSkeletonDataV1 {
     // ...
     patterns: ALL_THE_PATTERNS[#start..#end],
   })
}

This still needs some way to make the types work, the example above doesn't attempt to address that, but this could help for tricks like "store all of the data in a big VarZeroVec<PackedSkeletonDataV1ULE>,"

sffc commented 4 months ago

This is not 2.0 blocking unless we implement a breaking solution for #5187

sffc commented 4 months ago

@sffc - Broke in Rust nightly
@robertbastian - Not convinced current suggestions will improve compile times
@sffc - # 2 (build PackedSkeletonDataV1 at runtime) would produce the smallest code and probably the fastest compile times
@robertbastian - Moves the waiting from compile time to runtime. Want to avoid runtime intensive work
@sffc - Long compile times are a result of components/datetime/src/provider/mod.rs, which grabs all the baked data at once.
@robertbastian Could be inefficiencies in how we use macros
@sffc - Theory that a lot of the stress is on the lexer. If we make the AST thinner, that could reduce compile time. Option # 3 (construction via PackedSkeletonDataV1::new_unchecked(SkeletonDataIndex, &[u8])) reduces that scale
@robertbastian - Not sure if moving to const evaluation will be faster. Okay with trying, but I want to ask the Rust compiler team how we could best compile this data.
@sffc - Another alternative (illustrated in comments below options) is splitting up the AST to make it simpler to parse (ThingV1 example mixed with pub trait DynamicDataMarker {type Borrowed<'a>; type Yokeable: for<'a> Yokeable<'a> + ZeroFrom<Self::Borrowed<'a>>;})
@sffc We can do this in a non-breaking way. Option 3 moves work from source code lexing to const evaluation. Options 1 and 2 can use DataPayload::from_owned instead of DataPayload::from_static_ref, which increases runtime cost, but that can be something we weigh and measure.

(additional discussion not recorded)

Current thinking:

Produce a standalone crate reproducing this, such as by copying in a lot of ICU4X skeleton data
Ask someone who knows more, such as a Rust compiler expert, what we can do to make the crate compile faster
Optional: Try implementing the options to measure the impact
Should not block 2.0 because:
1. This would be a lot of time and effort
2. There may be paths to change this in a non-breaking ways
3. Maybe the compiler will get better on its own

LGTM: @sffc @robertbastian

Manishearth commented 4 months ago

LGTM as well

sffc commented 3 months ago

To add some urgency to this issue, @kartva says:

Compiling icu_experimental on main causes rustc to reliably crash on my laptop due to out-of-memory errors. Are there servers or codespaces that people working on icu4x can use?

(icu_experimental and icu_datetime are the two crates with the most finely sliced baked data)

kartva commented 3 months ago

I was previously compiling icu4x with 8 gigabytes of RAM. Right now, I observe RAM usage of up to 11 gigabytes when using baked data.

sffc commented 3 months ago

Building the latest icu_experimental data:

    Command being timed: "cargo build"
    User time (seconds): 30.57
    System time (seconds): 8.20
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:38.41
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 16492072
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 383
    Minor (reclaiming a frame) page faults: 4667399
    Voluntary context switches: 1181
    Involuntary context switches: 529
    Swaps: 0
    File system inputs: 83232
    File system outputs: 752960
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

sffc commented 3 months ago

Here are the figures for impl_units_display_name_v1_marker!(Baked); by itself:

    Command being timed: "cargo build"
    User time (seconds): 18.72
    System time (seconds): 5.43
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.14
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 16230900
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 225
    Minor (reclaiming a frame) page faults: 4573824
    Voluntary context switches: 693
    Involuntary context switches: 726
    Swaps: 0
    File system inputs: 26344
    File system outputs: 322536
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

That's a big share so I'll focus on reproducing just this in isolation.

sffc commented 3 months ago

Here's with changing the paths to be imported instead of absolute:

    Command being timed: "cargo build"
    User time (seconds): 17.70
    System time (seconds): 5.07
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.60
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 16218176
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 4571559
    Voluntary context switches: 434
    Involuntary context switches: 158
    Swaps: 0
    File system inputs: 0
    File system outputs: 403136
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

And here's with using a const constructor:

    Command being timed: "cargo build"
    User time (seconds): 7.46
    System time (seconds): 1.50
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.88
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3983060
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 1208393
    Voluntary context switches: 352
    Involuntary context switches: 59
    Swaps: 0
    File system inputs: 0
    File system outputs: 216696
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

And a helper function:

    Command being timed: "cargo build"
    User time (seconds): 7.71
    System time (seconds): 1.59
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.21
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3947592
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 1198126
    Voluntary context switches: 357
    Involuntary context switches: 269
    Swaps: 0
    File system inputs: 0
    File system outputs: 194544
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

include_bytes! for the trie data:

    Command being timed: "cargo build"
    User time (seconds): 7.26
    System time (seconds): 1.54
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.73
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 3961872
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 1202201
    Voluntary context switches: 390
    Involuntary context switches: 65
    Swaps: 0
    File system inputs: 0
    File system outputs: 193984
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

sffc commented 3 months ago

In tabular form (each row builds on the one above it):

Scenario	File Size	User Time	Clock Time	Maximum Resident Set Size
Full Experimental Data	N/A	30.57	38.41	16492072
Unit Names Only	13070690	18.72	24.14	16230900
With Imports	11301364	17.70	22.60	16218176
With Const Constructor	5038254	7.46	08.88	3983060
With Helper Wrapping Constructor	3923658	7.63	09.00	3953796
With Helper Directly Building	3923875	7.71	09.21	3947592
With `include_bytes!` trie	3042840	7.26	08.73	3961872

So there seems to be a pretty strong correlation between file size, compile time, and memory usage, except the last few rows with wrapper functions which reduce file size but don't seem to impact the other metrics. include_bytes! doesn't seem to have a big impact, at least not for the big trie byte string.

Manishearth commented 3 months ago

Do we know which compiler pass is actually slow (-Z time-passes, I believe)

Would be useful to compare that output to that of a normal utils crate and see where the big differences are.

sffc commented 2 months ago

Revisiting this since we're getting 143 errors again in CI.

Here's where we're currently at on main:

$ cargo clean; /usr/bin/time -v cargo +nightly build -p icu_experimental --all-features
...
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1m 00s
    Command being timed: "cargo +nightly build -p icu_experimental --all-features"
    User time (seconds): 63.19
    System time (seconds): 18.96
    Percent of CPU this job got: 135%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00.51
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 15888616
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 395
    Minor (reclaiming a frame) page faults: 6703629
    Voluntary context switches: 27185
    Involuntary context switches: 2375
    Swaps: 0
    File system inputs: 33184
    File system outputs: 2091712
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

I ran it with -Z time-passes:

   Compiling icu_experimental v0.1.0 (/usr/local/google/home/sffc/projects/icu4x/components/experimental)
time:   0.000; rss:   47MB ->   48MB (   +1MB)  parse_crate
time:   0.000; rss:   48MB ->   48MB (   +0MB)  incr_comp_prepare_session_directory
time:   0.000; rss:   48MB ->   49MB (   +1MB)  setup_global_ctxt
time:   0.000; rss:   51MB ->   51MB (   +0MB)  crate_injection
time:   1.013; rss:   51MB -> 1182MB (+1130MB)  expand_crate
time:   1.013; rss:   51MB -> 1182MB (+1131MB)  macro_expand_crate
time:   0.000; rss: 1182MB -> 1182MB (   +0MB)  maybe_building_test_harness
time:   0.028; rss: 1182MB -> 1182MB (   +0MB)  AST_validation
time:   0.001; rss: 1182MB -> 1182MB (   +0MB)  finalize_imports
time:   0.004; rss: 1182MB -> 1182MB (   +0MB)  finalize_macro_resolutions
time:   0.479; rss: 1182MB -> 1353MB ( +170MB)  late_resolve_crate
time:   0.023; rss: 1353MB -> 1353MB (   +0MB)  resolve_check_unused
time:   0.044; rss: 1353MB -> 1353MB (   +0MB)  resolve_postprocess
time:   0.551; rss: 1182MB -> 1353MB ( +171MB)  resolve_crate
time:   0.025; rss: 1273MB -> 1273MB (   +0MB)  write_dep_info
time:   0.020; rss: 1273MB -> 1273MB (   +0MB)  complete_gated_feature_checking
time:   0.072; rss: 1612MB -> 1456MB ( -156MB)  drop_ast
time:   1.630; rss: 1273MB -> 1307MB (  +34MB)  looking_for_derive_registrar
time:   1.898; rss: 1273MB -> 1308MB (  +35MB)  misc_checking_1
time:   0.274; rss: 1309MB -> 1331MB (  +22MB)  coherence_checking
time:   6.609; rss: 1308MB -> 1437MB ( +128MB)  type_check_crate
time:  36.802; rss: 1437MB -> 2248MB ( +811MB)  MIR_borrow_checking
time:   2.214; rss: 2248MB -> 2270MB (  +22MB)  MIR_effect_checking
time:   0.130; rss: 2270MB -> 2283MB (  +14MB)  module_lints
time:   0.130; rss: 2270MB -> 2283MB (  +14MB)  lint_checking
time:   0.253; rss: 2283MB -> 2283MB (   +0MB)  privacy_checking_modules
time:   0.061; rss: 2283MB -> 2277MB (   -6MB)  check_lint_expectations
time:   0.526; rss: 2270MB -> 2277MB (   +8MB)  misc_checking_3
time:   0.191; rss: 2301MB -> 2312MB (  +10MB)  monomorphization_collector_graph_walk
time:   0.033; rss: 2312MB -> 2312MB (   +1MB)  partition_and_assert_distinct_symbols
time:   0.587; rss: 2277MB -> 2294MB (  +16MB)  generate_crate_metadata
time:   1.064; rss: 2296MB -> 2320MB (  +24MB)  codegen_to_LLVM_IR
time:   3.448; rss: 2310MB -> 2320MB (  +11MB)  LLVM_passes
time:   3.540; rss: 2294MB -> 2319MB (  +26MB)  codegen_crate
time:   0.000; rss: 2319MB -> 2319MB (   +0MB)  check_dirty_clean
time:   0.358; rss: 2319MB -> 2323MB (   +4MB)  encode_query_results
time:   0.391; rss: 2319MB -> 2318MB (   -1MB)  incr_comp_serialize_result_cache
time:   0.391; rss: 2319MB -> 2318MB (   -1MB)  incr_comp_persist_result_cache
time:   0.392; rss: 2319MB -> 2318MB (   -1MB)  serialize_dep_graph
time:   0.101; rss: 2318MB -> 1314MB (-1004MB)  free_global_ctxt
time:   0.007; rss: 1314MB -> 1314MB (   +0MB)  copy_all_cgu_workproducts_to_incr_comp_cache_dir
time:   0.007; rss: 1314MB -> 1314MB (   +0MB)  finish_ongoing_codegen
time:   0.151; rss: 1288MB -> 1334MB (  +46MB)  link_rlib
time:   0.167; rss: 1288MB -> 1334MB (  +46MB)  link_binary
time:   0.170; rss: 1288MB -> 1279MB (   -9MB)  link_crate
time:   0.179; rss: 1314MB -> 1279MB (  -35MB)  link
time:  54.585; rss:   32MB ->  224MB ( +192MB)  total

Relative to what other crates do, the following steps are the biggest:

expand_crate
macro_expand_crate
late_resolve_crate
resolve_crate
MIR_borrow_checking

Of these, MIR_borrow_checking is by far the slowest relative to other crates; expand_crate and macro_expand_crate add the most to the memory usage.

Reminder from my previous exploration: adding a const constructor instead of doing raw struct construction did reduce compile times and peak memory usage (this was before I was measuring the rustc phase breakdown). This is consistent with @Manishearth's interpretation of the above figures.

sffc commented 2 months ago

I made a reproducible, standalone case here:

https://github.com/sffc/icu4x_compile_sample/tree/standalone

You can also see some of my previous work on making changes in the July 30 commits on the main branch:

https://github.com/sffc/icu4x_compile_sample/commits/main/

sffc commented 2 months ago

I added const constructors where they were most impactful in #5541.

I don't consider this a long-term solution because:

Custom bake impls are unsafe and should generally be avoided
The compile times are cut approximately in half, but they are still too big

But, for the short term, it should make CI less flaky.

unicode-org / icu4x

Baked data is big, and compiles slowly, for finely sliced data markers #5230