Closed syg closed 4 months ago
cc @msaboff if JSC has any numbers to share re: Apple's resource-constrained devices.
I have some good things to report on this topic. I've spent the last 2 weeks trying out some experiments in V8 to see what is most effective in reducing the binary size.
I've been measuring as follows. I use ./tools/dev/gm.py android_arm.release
to build V8 for ARM32, and am looking at the object files and the built d8
executable. Here's the build args generated by this configuration:
is_component_build = false
is_debug = false
target_cpu = "arm"
v8_target_cpu = "arm"
target_os = "android"
v8_enable_backtrace = true
v8_enable_disassembler = true
v8_enable_object_print = true
v8_enable_verify_heap = true
dcheck_always_on = false
Edit: also make sure to add v8_enable_runtime_call_stats = false
To try to reproduce the size measurements of the object files above, I used the following:
for file in builtins-temporal js-temporal-objects temporal-parser; do \
size -A -d out/android_arm.release/obj/v8_base_without_compiler/$file.o | \
grep -E '\.(text|rodata|data)' | cut -c20-25 | jq -s add; \
done
Here's what I get:
File | Size |
---|---|
builtins-temporal.o | |
js-temporal-objects.o | 157880 |
temporal-parser.o | 23920 |
Total |
Unfortunately these numbers are different from what Shu posted above. (Edit: They are only slightly different now.) We talked about this offline and concluded that it's probably not possible to reproduce it exactly — the numbers in the OP were measured with an internal Google tool. So I'm treating this baseline number with some skepticism. Instead, I'll focus on comparing various built d8
binaries to the d8
built from 'vanilla' V8, using Bloaty.
More to come in following comments!
A short overview of where the size goes to.
JS builtins are the JS function objects, approximately equivalent to intrinsics using the %-notation in the spec text, e.g. %Temporal.PlainDate.prototype.add%. There is one per public API and this is pretty much fixed. Temporal has ~300 of them.
C++ builtins are C++ functions linked in such a way that when you call a JS builtin from JS, the C++ builtin is executed. Currently, Temporal has about ~300 of them, one per public API, but unlike JS builtins that number is not fixed.
There is also C++ code called by the C++ builtins, but that isn't a builtin itself. Of course the machine code generated by the compiler also takes up space in the binary.
The only way you could reduce the number of JS builtins is by removing public API from the proposal. (Is it really the only way? You could try sharing the same JS builtin between more than one API entry point, but then you'd have e.g. Object.is(Temporal.PlainDate.prototype.valueOf, Temporal.PlainTime.prototype.valueOf)
equal to true which would leak implementation details to user code.)
You can reduce the number of C++ builtins by reusing the same C++ builtin for multiple JS builtins. I'll call this "multiplexing". I've found that C++ builtins take up the most space in the binary, 60% ~40% of the total space of Temporal, so there are good gains to be had here. More on this later.
You can also reduce the size of the C++ code, which hopefully leads to the compiled machine code being smaller. You really have to delete a lot of code for this to have any noticeable effect, I found.
Multiplexing. There seem to be many opportunities to combine similar C++ builtins into one. For example, the getters for the various types' .calendarId
properties could be implemented something like this, in JS-like pseudocode:
function CalendarIdGetter() {
let slotValue;
switch (arguments.callee[PRIVATE_SYMBOL_TEMPORAL_CONSTRUCTOR]) {
case PLAIN_DATE:
if (!IsTemporalDate(this)) throw new TypeError();
slotValue = this[PLAIN_DATE_CALENDAR_SLOT];
break;
case PLAIN_DATE_TIME:
if (!IsTemporalDateTime(this)) throw new TypeError();
slotValue = this[PLAIN_DATE_TIME_CALENDAR_SLOT];
break;
// ...
}
return ToTemporalCalendarIdentifier(slotValue);
}
Then each getter is a separate JS function object, but each calls the same C++ builtin as its body. We can stash the needed info to control the C++ builtin's behaviour on private properties of that JS function object, for example here we'd stash the numeric constant PLAIN_DATE
as Object.getOwnPropertyDescriptor(Temporal.PlainDate.prototype, 'calendarId').get[PRIVATE_SYMBOL_TEMPORAL_CONSTRUCTOR]
.
We can do similar things for the .valueOf()
methods, for the .epochSeconds
/.epochMilliseconds
. etc getters, and for all the getters such as .year
, .monthCode
, .daysInMonth
, .inLeapYear
etc that just invoke a calendar method when they are called.
The runtime overhead of this is basically 1 switch
statement. That's a naive view, as branch prediction will likely interfere, so it'd probably be good to benchmark that.
Note that there is precedent in the V8 codebase for this: There isn't a separate builtin for each TypedArray constructor, instead they all share the same one. (Although it's a TurboFan builtin, not a C++ builtin.)
I have a branch at https://github.com/ptomato/v8/commit/3cb3a6f5eda9a56009f7bc315ad9752e9c0c43f9 that implements some of this multiplexing. We could do more, or less of it, I just picked a few of the ones that seemed like obvious gains. d8
compiled from this branch is 68k 36k smaller than my baseline, about 19% 15% reduction.
You can also take the multiplexing idea to its extreme conclusion and multiplex everything into one giant C++ builtin. That essentially looks like this, in JS-like pseudocode:
function TemporalMethodMultiplexer() {
switch (arguments.callee[PRIVATE_SYMBOL_TEMPORAL_CONSTRUCTOR]) {
case PLAIN_DATE:
if (!IsTemporalDate(this)) throw new TypeError();
switch (arguments.callee[PRIVATE_SYMBOL_METHOD_NAME]) {
case 'add':
return TemporalPlainDatePrototypeAdd(this,
arguments[0], arguments[1]);
case 'subtract':
return TemporalPlainDatePrototypeSubtract(this,
arguments[0], arguments[1]);
// ...
}
case PLAIN_DATE_TIME:
if (!IsTemporalDateTime(this)) throw new TypeError();
switch (arguments.callee[PRIVATE_SYMBOL_METHOD_NAME]) {
// ...
}
// ...
}
}
This makes the runtime overhead basically 2 switch
statements, which is again a naive view. (In real code, the inner switch statements aren't really switch statements, they're string comparisons which sometimes collapse to pointer comparisons, because many but not all of the possible method names are internalized strings.)
This approach would probably need some fine-tuning based on usage data, because the order in which methods are considered, does matter for performance...
I also have a branch implementing this approach, at https://github.com/v8/v8/commit/fab0ae968f7ef712ae2d7e6e13e5fd72248a260e. (For reasons, it multiplexes everything into 3 builtins: one for all constructors, one for all static methods, and one for all prototype methods.) d8
compiled from this branch is 199k 92k smaller than my baseline, about 56% 38% reduction.
(This is what suggests to me that the most effective gains come from reducing the number of C++ builtins. This branch removes builtins but almost no actual implementation.)
Interesting, multiplexing is a much bigger reduction than I expected. Might be worth revisiting the decision.
Multiplexing was discussed and rejected in the past as a general technique because it muddies performance timers and tracing. One can argue that Temporal performance isn't so important, but the idea is if you're tracing a whole application, you'd want to know that X% of the app time is spent in, say, Date subtraction vs DateTime subtraction. That would be harder to tease out if there's just a multiplexed C++ frame for both.
Is there any interest in API simplification itself? It is extremely batteries-included.
I've been focusing primarily on whether we can achieve gains without touching the current state of the proposal and delaying it even more. But I did do a build where I simply ripped out a bunch of the APIs and all the code associated with them, probably about 30–40% of the total proposal, applied on top of the fully multiplexed build. That build was 56k 60k smaller than the fully multiplexed build with all APIs intact. (So that number applies to just the code, since all the C++ builtins were already gone.) (Edit: Note that with and without runtime callstats it removed the same number of bytes of machine code, about 56k, but in the no-RCS case the result happened to just fit in one less 4k page.)
So I think it's not a question of ripping out a few of the included batteries.
For comparison, I also did a build where I implemented #2789 and #2798, which are PRs with potential observable optimizations that we were not planning to pursue as normative changes unless they yielded significant size reductions. That build was 4k smaller but I suspect most of that was padding, because the few hundred bytes less of machine code happened to fit in one less 4k page. (Edit: That build was not any smaller at all, because decreases in the .text
and .data
segments were offset by increases in .relro_padding
.)
Of course only after I wrote up everything here did I figure out why my baseline number was so much higher than the one in the OP — my build had runtime callstats turned on, while Shu mentioned:
V8 has already disabled runtime callstats (a profiling functionality that bloats code size) on release channels due to Temporal.
(So I assumed that the release builds I'd been making also had runtime callstats disabled, but that was not the case.)
I'll go back through the thread and edit in the updated numbers, leaving the old ones visible with strikeout. But here is a summary.
The size reduction from multiplexing is not as dramatic as I originally thought, but it is still large and in my opinion worth doing.
Additionally, taking the full multiplexing approach could allow V8 to switch runtime callstats back on by default, if Temporal was the only reason it was disabled; once all the Temporal builtins have been removed by multiplexing, runtime callstats actually make builtins-temporal.o
2k smaller for some reason (although of course, the d8
binary is still considerably larger in total with runtime callstats.)
The partial multiplex approach saves -36k, the full multiplex approach saves -92k, and ripping out 30–40% of the proposal still saves about -56k although this time, due to page alignment, we lose several k of padding additionally.
See #2834 for a different approach to reducing the number of builtins.
I've been asked to comment on this. Summary: size is not free, it does make sense to trim bloat when possible.
Multiplexing addresses one aspect of the cost of having many functions, by avoiding duplication of machine code compiled from C++ (or whatever language the engine is written in), and it's nice to see those savings. (FWIW, I'd use the size
command to compare the .text section size; including padding in the results isn't representative because it'll change "randomly" on every unrelated commit, so you might just get lucky or unlucky at the particular measurement you take.)
Multiplexing doesn't address another aspect of the cost of having many functions though, and that's simply having the JavaScript object hierarchy. Each object property increases the size of that object. Each closure (which all have observable identity: (function() {}) !== (function() {})
) takes some pointers to store in memory, even when its actual executable body is multiplexed/shared. For a single built-in function, that's of course not very much (I haven't measured it, something on the order of 100 bytes or so would sound about right), but if you have 320 Temporal.*
functions installed on the global object, dozens of <iframe>
s on your average website with their own global object each, and dozens of open tabs in the typical browser session, it does add up to quite significant memory consumption. Many users' systems are eternally close to an out-of-memory situation; we know from (sometimes accidental...) experiments that allocating just 20MB more causes a significant uptick in the out-of-memory renderer crash rate. (That's the key reason why adding things to the platform should be a very careful consideration: it just doesn't make sense to burden every <iframe>
with every feature that anyone might ever need.)
As a related data point, V8 uses a "snapshotting" technique (deserializing a heap snapshot that's embedded in the shipped binary) because creating all of Math.*
, Array.*
, etc from scratch for every <iframe>
would be way too expensive. While snapshotting turns this from "a big problem" to "not much of a problem", that doesn't mean that doubling (or quadrupling even?) the required size of the snapshot wouldn't matter.
Since you're already experimenting with custom V8 builds to measure things, you could relatively simply measure what dropping all or part of Temporal
does to memory consumption after starting a d8
session. That's a single global object; so for an estimate of the impact on the average browser session you can multiply the number by 100 to 1000.
I haven't measured the runtime memory consumption. Those experiments could be done, but I focused on the static code size here because that's what @syg requested at the top of the thread. If we need to reverse course on that, I'd appreciate having some guidelines as to what kind of runtime size increase Google would consider acceptable and what would not.
Is the snapshot size not included in the stats that Bloaty gives when comparing d8 binaries?
I didn't mean to suggest reversing course. Binary size is an important metric. My previous comment came from a more general "why proposal size reductions are worth considering" perspective: in addition to binary size, memory consumption matters too. Sorry that that was a bit of a deviation from the original topic of this thread; feel free to discuss it elsewhere if you prefer.
I don't know how to determine a number X such that "X bytes is acceptable, X+1 bytes is not" (for either binary size or memory consumption). The less, the better. If you can cover 98% of the use cases for 70% of the cost, go for it. There may be a few developers who will complain that some niche feature they liked is gone, but on the flip side there will be a few billion users who will be happy that the software on their devices is a little less bloated.
@jakobkummerow Thanks for the explanation. This is really helpful. I share your goal of reducing complexity and surface area where possible.
I'm a little curious what your thoughts are on a different kind of multiplexing, more like TypedArray methods: what if we reused the same function identity, but then multiplexed internally among function bodies, as proposed for toJSON https://github.com/tc39/proposal-temporal/issues/2858 ?
Not that this weakens your argument, but I'm just curious about the impact on making the snapshot bigger, and therefore more snapshot restoration work: Has the V8 team recently considered implementing lazy allocation of the objects backing some JS builtins, as other browsers do? I wonder if that could help mitigate the runtime memory usage/startup time cost. Is such a technique too complex/costly?
what if we reused the same function identity, but then multiplexed internally among function bodies
I haven't really formed an opinion on which of the size reduction ideas under discussion are most worthwhile/impactful/acceptable.
As a quick estimate: each function with object identity needs a JSFunction
(32 bytes), if it has its own JS-level source then it also needs a SharedFunctionInfo
(48 bytes), if it has its own C++ implementation then the builtin tables probably add another two pointers or so. I'm not exactly sure which two scenarios are being compared at #2858, but that could mean a lower bound of 800 bytes saved. (The size impact of changes is notoriously hard to predict accurately; when in doubt, try it and measure!)
Has the V8 team recently considered implementing lazy allocation of the objects backing some JS builtins
I haven't heard of any such explorations recently, fwiw.
I'm assuming the slate of removals that was approved in the TC39 meeting on 2024-06-12 is sufficient to close this issue. Please let us know ASAP if that's not the case.
As discussed elsewhere, Chrome and V8 would ideally like to see the size of the proposal reduced to make shipping on low-resource devices like lower end Android phones more palatable.
This issue is only for focusing on the static code size, the size that's in the shipped binary of Chrome.
The bulk of Temporal in V8 is implemented across 3 files:
builtins-temporal.cc
,js-temporal-objects.cc
, andtemporal-parser.cc
. IIUC this isn't even complete yet. @FrankYFTang, is that the case?The numbers we have as of V8 12.4 on 32bit release Android APK:
.text
+.data
+.rodata
+.data.rel.ro
builtins-temporal.cc
js-temporal-objects.cc
temporal-parser.cc
V8 contributes a total of 9539.62 KiB, which makes Temporal at least 2.6% of all of V8.
This is after size optimizations. V8 has already disabled runtime callstats (a profiling functionality that bloats code size) on release channels due to Temporal.
The guidelines from the Android folks is that a feature that adds 100-250 KiB is considered Medium, with 250+ KiB considered Large. "Features" in that context is not strictly language features. A new JIT tier is also a feature, and V8 uses a lot of size budget for our compilers since we believe better performance has outsized impact.
So, ideally, any reduction in code size here would be welcome. The closer we get to 100 KiB the better.