Open sffc opened 4 months ago
2 sounds compelling if we can make it work cleanly in our baking infra
potentially have a flag for data keys that marks them as "VZV packable"
Discussed this briefly with @robertbastian. Some points:
icu_properties
are still possible but are a little bit more intricate because we end up with 3 states of the data struct: yoked, borrowed-from-yoke, and fully-borrowed, where the fully-borrowed type is separate from the data struct type and is the one used internally in baked.To illustrate that last point:
// Data struct type
#[derive(Clone)]
pub struct ThingV1<'data> {
pub a: VarZeroVec<'data, str>,
pub b: VarZeroVec<'data, str>,
}
// Borrowed type
#[derive(Copy, Clone)]
pub(crate) struct ThingV1Borrowed<'data> {
pub a: &'data VarZeroSlice<str>,
pub b: &'data VarZeroSlice<str>,
}
// Example top-level owned type
pub struct ThingFormatter {
payload: DataPayload<ThingV1Marker>
}
// Example top-level borrowed type
pub struct ThingFormatterBorrowed<'data> {
payload: ThingV1Borrowed<'data>
}
// To get from one to the other
impl ThingFormatter {
pub fn as_borrowed(&self) -> ThingFormatterBorrowed {
self.payload.get().as_borrowed()
}
}
In the above example, ThingV1Borrowed
can always be constructed from borrowed data, so compiled_data
constructors directly to the borrowed type can still work fine.
(SkeletonDataIndex, &[u8])
is (SkeletonDataIndex, &VarZeroSlice<'data, PatternULE>)
. The generic way to do this would be to have a fully borrowed version of each data struct, instead of always having ZeroVec
and Cow
at the leaves, which have a cost.DataPayload<ExportMarker>
, and we can call bake()
on the data struct. Then what?You're right about &[u8]
and &VarZeroSlice
being the same. I'll switch to using &VarZeroSlice
.
Maybe 1, 2, 3 are better illustrated with examples:
// Option 0, Current (not exactly, but equivalent):
payloads: [
PackedSkeletonDataV1 {
index_info: SkeletonDataIndex::from_raw(...),
patterns: VarZeroVec::from_bytes_unchecked(...)
},
// ...
]
// Option 1:
payloads: [
(SkeletonDataIndex::from_raw(...), &'static VarZeroSlice::from_bytes_unchecked(...)),
// ...
]
// at runtime, use ZeroFrom to get the PackedSkeletonDataV1,
// or put it directly into the borrowed struct
// Option 2:
payloads: VarZeroSlice::from_bytes_unchecked(
// entries are the VarULE repr of PackedSkeletonDataV1
)
// at runtime, use ZeroFrom to get the PackedSkeletonDataV1,
// or put it directly into the borrowed struct
// Option 3:
impl PackedSkeletonDataV1 {
pub const unsafe fn from_parts(raw_index_info, raw_patterns) -> Self {
Self {
index_info: SkeletonDataIndex::from_raw(raw_index_info),
patterns: VarZeroVec::from_bytes_unchecked(raw_patterns),
}
}
}
payloads: [
PackedSkeletonDataV1::from_parts(..., ...)
]
// identical runtime characteristics to the current implementation
~Actually, in most constructors we are already using ZeroFrom, so I'm not convinced that option 1 is actually a regression from the status quo. I think it's basically equivalent, we just store &'static VarZeroSlice
instead of VarZeroVec<'static>
which is 1-2 words smaller.~
The baked provider used to use ZeroFrom
, but it's not anymore since we added DataPayloadInner::StaticRef
, which was a change you can detect in benchmarks.
// correction: currently we have a slice of structs, not an array of struct refs
payloads: &'static [
PackedSkeletonDataV1 {
index_info: SkeletonDataIndex::from_raw(...),
patterns: VarZeroVec::from_bytes_unchecked(...)
},
// ...
]
Option 3 is equivalent to option 1, because payloads
is stored in a const
. from_parts
will be evaluated at compile time, and the data struct stored in the const. If we instead stored the borrowed version of the data struct in the const
, we're back at option 1.
Yes, I briefly forgot about DataPayloadInner::StaticRef
. So options 0 and 1 are different, as claimed in the OP.
Option 3 is equivalent to the current solution, option 0 (not option 1), except for file size being slightly smaller.
A more radical solution (bigger change but maybe better outcome) would be to add it to DynamicDataMarker
pub trait DynamicDataMarker {
type Borrowed<'a>;
// not sure if this is the right syntax but you get the idea:
type Yokeable: for<'a> Yokeable<'a> + ZeroFrom<Self::Borrowed<'a>>;
}
This^ is the only implementation path I see for option 1, what alternative were you thinking of?
The other way to implement option 1 would be to have a databake derive attribute that defines how to get from the static representation to the struct, which we could do as part of #2452. baked_exporter
uses it when available, and otherwise serializes the struct directly.
One thing databake could potentially do is have a way for the CrateEnv to collect auxiliary codegen, which would work by:
Something like:
struct PackedDataSkeletonV1PatternsAux {
patterns: Vec<_>,
}
fn bake(&self, env: &CrateEnv) -> TokenStream2 {
// This uses an anymap or something
let map = env.get_or_insert::<PackedDataSkeletonV1PatternsAux>(PackedDataSkeletonV1PatternsAux::default(),
// The flush function. This is just an example.
|aux| {quote!( const ALL_THE_PATTERNS = #patterns )});
let start = map.patterns.len();
map.patterns.extend(self.patterns);
let end = map.patterns.len();
quote!(PackedSkeletonDataV1 {
// ...
patterns: ALL_THE_PATTERNS[#start..#end],
})
}
This still needs some way to make the types work, the example above doesn't attempt to address that, but this could help for tricks like "store all of the data in a big VarZeroVec<PackedSkeletonDataV1ULE>
,"
This is not 2.0 blocking unless we implement a breaking solution for #5187
PackedSkeletonDataV1
at runtime) would produce the smallest code and probably the fastest compile timesPackedSkeletonDataV1::new_unchecked(SkeletonDataIndex, &[u8])
) reduces that scaleThingV1
example mixed with pub trait DynamicDataMarker {type Borrowed<'a>; type Yokeable: for<'a> Yokeable<'a> + ZeroFrom<Self::Borrowed<'a>>;}
)DataPayload::from_owned
instead of DataPayload::from_static_ref
, which increases runtime cost, but that can be something we weigh and measure.(additional discussion not recorded)
Current thinking:
LGTM: @sffc @robertbastian
LGTM as well
To add some urgency to this issue, @kartva says:
Compiling
icu_experimental
on main causes rustc to reliably crash on my laptop due to out-of-memory errors. Are there servers or codespaces that people working on icu4x can use?
(icu_experimental
and icu_datetime
are the two crates with the most finely sliced baked data)
I was previously compiling icu4x with 8 gigabytes of RAM. Right now, I observe RAM usage of up to 11 gigabytes when using baked data.
Building the latest icu_experimental data:
Command being timed: "cargo build"
User time (seconds): 30.57
System time (seconds): 8.20
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:38.41
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 16492072
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 383
Minor (reclaiming a frame) page faults: 4667399
Voluntary context switches: 1181
Involuntary context switches: 529
Swaps: 0
File system inputs: 83232
File system outputs: 752960
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Here are the figures for impl_units_display_name_v1_marker!(Baked);
by itself:
Command being timed: "cargo build"
User time (seconds): 18.72
System time (seconds): 5.43
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.14
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 16230900
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 225
Minor (reclaiming a frame) page faults: 4573824
Voluntary context switches: 693
Involuntary context switches: 726
Swaps: 0
File system inputs: 26344
File system outputs: 322536
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
That's a big share so I'll focus on reproducing just this in isolation.
Here's with changing the paths to be imported instead of absolute:
Command being timed: "cargo build"
User time (seconds): 17.70
System time (seconds): 5.07
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.60
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 16218176
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 4571559
Voluntary context switches: 434
Involuntary context switches: 158
Swaps: 0
File system inputs: 0
File system outputs: 403136
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
And here's with using a const constructor:
Command being timed: "cargo build"
User time (seconds): 7.46
System time (seconds): 1.50
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.88
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3983060
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1208393
Voluntary context switches: 352
Involuntary context switches: 59
Swaps: 0
File system inputs: 0
File system outputs: 216696
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
And a helper function:
Command being timed: "cargo build"
User time (seconds): 7.71
System time (seconds): 1.59
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.21
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3947592
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1198126
Voluntary context switches: 357
Involuntary context switches: 269
Swaps: 0
File system inputs: 0
File system outputs: 194544
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
include_bytes! for the trie data:
Command being timed: "cargo build"
User time (seconds): 7.26
System time (seconds): 1.54
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.73
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3961872
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1202201
Voluntary context switches: 390
Involuntary context switches: 65
Swaps: 0
File system inputs: 0
File system outputs: 193984
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
In tabular form (each row builds on the one above it):
Scenario | File Size | User Time | Clock Time | Maximum Resident Set Size |
---|---|---|---|---|
Full Experimental Data | N/A | 30.57 | 38.41 | 16492072 |
Unit Names Only | 13070690 | 18.72 | 24.14 | 16230900 |
With Imports | 11301364 | 17.70 | 22.60 | 16218176 |
With Const Constructor | 5038254 | 7.46 | 08.88 | 3983060 |
With Helper Wrapping Constructor | 3923658 | 7.63 | 09.00 | 3953796 |
With Helper Directly Building | 3923875 | 7.71 | 09.21 | 3947592 |
With include_bytes! trie |
3042840 | 7.26 | 08.73 | 3961872 |
So there seems to be a pretty strong correlation between file size, compile time, and memory usage, except the last few rows with wrapper functions which reduce file size but don't seem to impact the other metrics. include_bytes!
doesn't seem to have a big impact, at least not for the big trie byte string.
Do we know which compiler pass is actually slow (-Z time-passes
, I believe)
Would be useful to compare that output to that of a normal utils crate and see where the big differences are.
Revisiting this since we're getting 143 errors again in CI.
Here's where we're currently at on main:
$ cargo clean; /usr/bin/time -v cargo +nightly build -p icu_experimental --all-features
...
Finished `dev` profile [unoptimized + debuginfo] target(s) in 1m 00s
Command being timed: "cargo +nightly build -p icu_experimental --all-features"
User time (seconds): 63.19
System time (seconds): 18.96
Percent of CPU this job got: 135%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00.51
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 15888616
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 395
Minor (reclaiming a frame) page faults: 6703629
Voluntary context switches: 27185
Involuntary context switches: 2375
Swaps: 0
File system inputs: 33184
File system outputs: 2091712
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
I ran it with -Z time-passes
:
Compiling icu_experimental v0.1.0 (/usr/local/google/home/sffc/projects/icu4x/components/experimental)
time: 0.000; rss: 47MB -> 48MB ( +1MB) parse_crate
time: 0.000; rss: 48MB -> 48MB ( +0MB) incr_comp_prepare_session_directory
time: 0.000; rss: 48MB -> 49MB ( +1MB) setup_global_ctxt
time: 0.000; rss: 51MB -> 51MB ( +0MB) crate_injection
time: 1.013; rss: 51MB -> 1182MB (+1130MB) expand_crate
time: 1.013; rss: 51MB -> 1182MB (+1131MB) macro_expand_crate
time: 0.000; rss: 1182MB -> 1182MB ( +0MB) maybe_building_test_harness
time: 0.028; rss: 1182MB -> 1182MB ( +0MB) AST_validation
time: 0.001; rss: 1182MB -> 1182MB ( +0MB) finalize_imports
time: 0.004; rss: 1182MB -> 1182MB ( +0MB) finalize_macro_resolutions
time: 0.479; rss: 1182MB -> 1353MB ( +170MB) late_resolve_crate
time: 0.023; rss: 1353MB -> 1353MB ( +0MB) resolve_check_unused
time: 0.044; rss: 1353MB -> 1353MB ( +0MB) resolve_postprocess
time: 0.551; rss: 1182MB -> 1353MB ( +171MB) resolve_crate
time: 0.025; rss: 1273MB -> 1273MB ( +0MB) write_dep_info
time: 0.020; rss: 1273MB -> 1273MB ( +0MB) complete_gated_feature_checking
time: 0.072; rss: 1612MB -> 1456MB ( -156MB) drop_ast
time: 1.630; rss: 1273MB -> 1307MB ( +34MB) looking_for_derive_registrar
time: 1.898; rss: 1273MB -> 1308MB ( +35MB) misc_checking_1
time: 0.274; rss: 1309MB -> 1331MB ( +22MB) coherence_checking
time: 6.609; rss: 1308MB -> 1437MB ( +128MB) type_check_crate
time: 36.802; rss: 1437MB -> 2248MB ( +811MB) MIR_borrow_checking
time: 2.214; rss: 2248MB -> 2270MB ( +22MB) MIR_effect_checking
time: 0.130; rss: 2270MB -> 2283MB ( +14MB) module_lints
time: 0.130; rss: 2270MB -> 2283MB ( +14MB) lint_checking
time: 0.253; rss: 2283MB -> 2283MB ( +0MB) privacy_checking_modules
time: 0.061; rss: 2283MB -> 2277MB ( -6MB) check_lint_expectations
time: 0.526; rss: 2270MB -> 2277MB ( +8MB) misc_checking_3
time: 0.191; rss: 2301MB -> 2312MB ( +10MB) monomorphization_collector_graph_walk
time: 0.033; rss: 2312MB -> 2312MB ( +1MB) partition_and_assert_distinct_symbols
time: 0.587; rss: 2277MB -> 2294MB ( +16MB) generate_crate_metadata
time: 1.064; rss: 2296MB -> 2320MB ( +24MB) codegen_to_LLVM_IR
time: 3.448; rss: 2310MB -> 2320MB ( +11MB) LLVM_passes
time: 3.540; rss: 2294MB -> 2319MB ( +26MB) codegen_crate
time: 0.000; rss: 2319MB -> 2319MB ( +0MB) check_dirty_clean
time: 0.358; rss: 2319MB -> 2323MB ( +4MB) encode_query_results
time: 0.391; rss: 2319MB -> 2318MB ( -1MB) incr_comp_serialize_result_cache
time: 0.391; rss: 2319MB -> 2318MB ( -1MB) incr_comp_persist_result_cache
time: 0.392; rss: 2319MB -> 2318MB ( -1MB) serialize_dep_graph
time: 0.101; rss: 2318MB -> 1314MB (-1004MB) free_global_ctxt
time: 0.007; rss: 1314MB -> 1314MB ( +0MB) copy_all_cgu_workproducts_to_incr_comp_cache_dir
time: 0.007; rss: 1314MB -> 1314MB ( +0MB) finish_ongoing_codegen
time: 0.151; rss: 1288MB -> 1334MB ( +46MB) link_rlib
time: 0.167; rss: 1288MB -> 1334MB ( +46MB) link_binary
time: 0.170; rss: 1288MB -> 1279MB ( -9MB) link_crate
time: 0.179; rss: 1314MB -> 1279MB ( -35MB) link
time: 54.585; rss: 32MB -> 224MB ( +192MB) total
Relative to what other crates do, the following steps are the biggest:
Of these, MIR_borrow_checking is by far the slowest relative to other crates; expand_crate and macro_expand_crate add the most to the memory usage.
Reminder from my previous exploration: adding a const constructor instead of doing raw struct construction did reduce compile times and peak memory usage (this was before I was measuring the rustc phase breakdown). This is consistent with @Manishearth's interpretation of the above figures.
I made a reproducible, standalone case here:
https://github.com/sffc/icu4x_compile_sample/tree/standalone
You can also see some of my previous work on making changes in the July 30 commits on the main branch:
I added const constructors where they were most impactful in #5541.
I don't consider this a long-term solution because:
But, for the short term, it should make CI less flaky.
icu_datetime
compile times have regressed a lot since I added neo datetime data, and https://github.com/unicode-org/icu4x/pull/5221 appears to be choking in CI.The finely sliced data markers (small data structs designed to work with many data marker attributes) give compelling data sizes and stack sizes in Postcard (https://github.com/unicode-org/icu4x/pull/4818, https://github.com/unicode-org/icu4x/pull/4779). However, in Baked, they significantly increase file size, and the numbers for data size are also not as compelling because baked data includes a lot more pointers (for example, at least 24 bytes for a ZeroVec) which are duplicated for each and every instance of the data struct.
Example data struct that is used in a finely sliced data marker:
Some ideas:
PackedSkeletonDataV1<'static>
, we could instead store many static instances of(SkeletonDataIndex, &[u8])
, and build an instance ofPackedSkeletonDataV1<'static>
at runtime. This is "free", and it should significantly reduce file size, but it causes us to use a Yoke code path.VarZeroVec<PackedSkeletonDataV1ULE>
, and build an instance ofPackedSkeletonDataV1<'static>
at runtime. This should result in the smallest file size and data size, in line with postcard sizes, but is a bit more of a runtime cost since we need to do a VZV lookup. However, it's only one lookup and only when the locale was found, so I don't think we should try to avoid this cost for the sake of avoiding this cost.pub fn PackedSkeletonDataV1::new_unchecked(SkeletonDataIndex, &[u8])
, reducing file size and therefore probably compile times without changing any runtime characteristics. See https://github.com/unicode-org/icu4x/issues/2452.@robertbastian @Manishearth @younies