Improve finding edges to data segments

fitzgen commented 6 years ago

Ok, clearly this wasm's size has a lot going on in the data section:

~/twiggy
$ twiggy top -n 10 ~/gutenberg-parser-rs/bindings/wasm/gutenberg_post_parser.debug.wasm 
 Shallow Bytes │ Shallow % │ Item
───────────────┼───────────┼───────────────────────────────────────────────────────────────────────────────────
          6285 ┊     6.49% ┊ data[4]
          3653 ┊     3.77% ┊ gutenberg_post_parser::parser::block::h482c2450268e1452
          3485 ┊     3.60% ┊ <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::from_iter::h16c80f212f4e48bd
          2726 ┊     2.82% ┊ "function names" subsection
          2593 ┊     2.68% ┊ data[2807]
          1900 ┊     1.96% ┊ data[5]
          1796 ┊     1.86% ┊ data[2804]
          1771 ┊     1.83% ┊ data[0]
          1749 ┊     1.81% ┊ data[2749]
          1486 ┊     1.54% ┊ data[38]

Which functions are using this data?

~/twiggy
$ twiggy paths ~/gutenberg-parser-rs/bindings/wasm/gutenberg_post_parser.debug.wasm 'data[4]' 'data[2807]' 'data[5]' 'data[2804]' 'data[0]' 'data[2749]' 'data[38]'
 Shallow Bytes │ Shallow % │ Retaining Paths
───────────────┼───────────┼────────────────
          1771 ┊     1.83% ┊ data[0]
          6285 ┊     6.49% ┊ data[4]
          1900 ┊     1.96% ┊ data[5]
          1486 ┊     1.54% ┊ data[38]
          1749 ┊     1.81% ┊ data[2749]
          1796 ┊     1.86% ┊ data[2804]
          2593 ┊     2.68% ┊ data[2807]

No edges to the data segments found. I don't believe that is accurate.

fitzgen commented 6 years ago

input test case

fitzgen commented 6 years ago

This should help us get useful information about data segments: https://reviews.llvm.org/D46417

srenatus commented 6 years ago

🔍 From sprinkling println! on it, it looks like for the input test case provided, let I32Const(base) never matches:

if let I32Const(base) = code[i - 1] {
  if let Some(data_id) = items.get_data(base as u32 + off) {
    items.add_edge(body_id, data_id);
  }
}

However, I'm just beginning to wrap my mind around what's happening there... currently, I fail to understand the meaning (and necessity) of base -- it seems like offset would be all that is needed to load something from the linear data (i.e., all the data segments concatenated), and hence enough to determine if a particular data segment is needed.

Also, am I right assuming that optimizing data segments isn't all that simple -- we cannot drop something without changing all existing offsets. I suppose it could be overwritten with zeros, and thus be end up being better to compress? (And additionally, it's hard to determine when the default data segment content from the wasm file is used, or when it is overwritten (using the store instructions), and loaded from that, isn't it?)

srenatus commented 6 years ago

Update -- I've found a reference for my 💭 points above: https://github.com/rust-lang-nursery/rust-wasm/issues/21 ✅

fitzgen commented 6 years ago

However, I'm just beginning to wrap my mind around what's happening there... currently, I fail to understand the meaning (and necessity) of base -- it seems like offset would be all that is needed to load something from the linear data (i.e., all the data segments concatenated), and hence enough to determine if a particular data segment is needed.

Accessing array[i] -- where array is some global data array and i is our static index into it -- will translate roughly into

i32.const $address_of_array
i32.load $i

The i is the static offset in that code snippet.

The thing is, this will only work for static indices, where the offset is embedded in the instruction. Any kind of dynamic indexing into the array won't be seen by this (very simple, conservative) analysis. I suspect that this is what is happening in practice.

Also, am I right assuming that optimizing data segments isn't all that simple -- we cannot drop something without changing all existing offsets. I suppose it could be overwritten with zeros, and thus be end up being better to compress? (And additionally, it's hard to determine when the default data segment content from the wasm file is used, or when it is overwritten (using the store instructions), and loaded from that, isn't it?)

The only way to safely remove them is by having and parsing relocations from the linker. Twiggy is just performing (or at least attempting to perform) a conservative analysis for informative purposes. It can't be complete unless both (1) relocations are present, and (2) we teach twiggy how to understand them. That said, it would be cool to add that feature, and it would also be cool to extend our conservative analysis to understand more cases.

In general, I think a full abstract interpreter for wasm would be super useful here, although that definitely belongs in a project of its own :)

srenatus commented 6 years ago

@fitzgen thanks for shedding some light on this 😃

rustwasm / twiggy

Improve finding edges to data segments #45