try out table-driven `Visitor::element` and `Visitor::attribute` to reduce code size

Try out the approach to reduce the per-type overhead of these methods described in the README:

Currently static-xml-derive writes explicit generated code.

E.g., with this type:

#[derive(Deserialize, Serialize)]
struct Foo {
    known_field: String,
    #[static_xml(flatten)]
    unknown_fields: xmltree::Element,
}

The Deserialize macro produces code roughly as follows:

const ELEMENTS: &[ExpandedNameRef; 1] = &[
    ExpandedNameRef { local_name: "known_field", namespace: "" },
];

impl Deserialize for Foo {
    fn deserialize(element: ElementReader<'_>) -> Result<Self, VisitorError> {
        let mut builder = FooVisitor {
            known_field: <String as DeserializeField>::init(),
            unknown_fields: <XmlTree as DeserializeField>::init(),
        };
        element.read_to(&mut builder)?;
        Self {
            known_field: <String as DeserializeField>::finalize(builder.known_field)?,
            unknown_fields: <XmlTree as DeserializeField>::finalize(builder.unknown_fields)?,
        }
    }
}

pub struct FooVisitor {
    known_field: <String as DeserializeField>::Builder,
    unknown_fields: <xmltree::Element as DeserializeFlatten>::Builder,
}

impl ElementVisitor for FooVisitor {
    fn element<'a>(
        &mut self,
        child: ElementReader<'a>
    ) -> Result<Option<ElementReader<'a>>, VisitorError> {
        match find(&child.expanded_name(), ELEMENTS) {
            Some(0usize) => {
                ::static_xml::de::DeserializeFieldBuilder::element(&mut self.known_field, child)?;
                return Ok(None);
            }
            _ => delegate_element(&mut [&mut self.unknown_fields], child),
        }
    }
}

I believe this is close to the minimal size with this approach. Next I'd like to experiment with a different approach in which the Visitor impl is replaced with a table that holds the offset within FooVisitor of each field, and a pointer to an element function. The generated code would use unsafe, but soundness only has to be proved once in the generator, and this seems worthwhile if it can achieve significant code size reduction.

I just realized that on stable Rust, it's currently impossible to calculate offsets within structures from a const context => no compiled-in struct (de)serialization tables. That's disappointing. Plan:

I'm going to try using nightly with memoffset's unstable_const feature as a proof-of-concept and see what sizes are possible.
If that goes well, I'll also try generating them at runtime with once_cell and see how much bloat this adds back. I suppose I could even decide between compile-time or runtime with a feature flag or something.

It looks like folks are working on stabilizing this stuff; I hope that trade-off doesn't last long.

There's another problem I didn't anticipate: some schemas have cycles. Here's a trivial one (a type that refers directly back to itself):

#[derive(PartialEq, Debug, Serialize, Deserialize)]
#[static_xml(prefix = "tt", namespace = "tt: http://www.onvif.org/ver10/schema")]
pub struct Transport {
    // Defines the network protocol for streaming, either UDP=RTP/UDP,
    // RTSP=RTP/RTSP/TCP or HTTP=RTP/RTSP/HTTP/TCP
    #[static_xml(prefix = "tt", rename = "Protocol")]
    pub protocol: TransportProtocol,

    // Optional element to describe further tunnel options. This element is
    // normally not needed
    #[static_xml(prefix = "tt", rename = "Tunnel")]
    pub tunnel: Vec<Transport>,
}

The type vtable for transport (which should know how to deserialize it) needs to refer to the struct table for transport (with per-field information), which needs to refer back to the type vtable to know how to deserialize the tunnel field.

Rust understandably doesn't let me generate consts which have Rust references in a cycle. E.g., the following won't compile.

struct Foo(&'static Foo);

const A: Foo = Foo(&B);
const B: Foo = Foo(&A);

Similarly, producing the vtable from a once_cell::sync::Lazy and evaluating fields' lazy cells recursively can't possibly work, and the docs say this:

It is an error to reentrantly initialize the cell from f. The exact outcome is unspecified. Current implementation deadlocks, but this may be changed to a panic in the future.

I guess I can build these in a lazy cell, have them to reference the lazy cell itself (not using its Deref), and then dereference when actually accessing the field.

Or something like building a (mutex-guarded) hash map keyed by std::any::TypeId but that's probably not any better and raises the question of what fills it.

I think this might be a dead end. :-( The problem is that I don't see a way to generate the vtables at compile-time (for the reasons described above), and the code to generate them more than cancels out the gains of the table-driven approach.

[slamb@slamb-workstation ~/git/crates/onvif-rs]$ cargo bloat --release --example camera --filter schema
   Compiling static-xml v0.1.0 (/home/slamb/git/static-xml/static-xml)
   Compiling xsd-types v0.1.0 (/home/slamb/git/crates/xsd-parser-rs/xsd-types)
   Compiling static-xml-derive v0.1.0 (/home/slamb/git/static-xml/static-xml-derive)
   Compiling schema v0.1.0 (/home/slamb/git/crates/onvif-rs/schema)
   Compiling onvif v0.1.0 (/home/slamb/git/crates/onvif-rs/onvif)
    Finished release [optimized] target(s) in 36.51s
    Analyzing target/release/examples/camera

File .text     Size  Crate Name
0.0%  0.0%   1.7KiB schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0%  0.0%   1.2KiB schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0%  0.0%   1.2KiB schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0%  0.0%   1.2KiB schema schema::soap_envelope::Fault::is_unauthorized
0.0%  0.0%   1.1KiB schema schema::onvif::_::<impl static_xml::ser::Serialize for schema::onvif::StreamSetup>::write_children
0.0%  0.0%   1.1KiB schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0%  0.0%     894B schema <schema::onvif::Ptzconfiguration as core::fmt::Debug>::fmt
0.0%  0.0%     885B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0%  0.0%     851B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0%  0.0%     849B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0%  0.0%     814B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0%  0.0%     802B schema schema::onvif::_::<impl static_xml::ser::Serialize for schema::onvif::Ipaddress>::write_children
0.0%  0.0%     764B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0%  0.0%     751B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0%  0.0%     734B schema schema::onvif::_::<impl static_xml::ser::Serialize for schema::onvif::MetadataConfiguration>::write_children
0.0%  0.0%     710B schema <T as static_xml::ser::SerializeField>::write
0.0%  0.0%     710B schema <T as static_xml::ser::SerializeField>::write
0.0%  0.0%     706B schema <T as static_xml::ser::SerializeField>::write
0.0%  0.0%     681B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0%  0.0%     681B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
1.0%  3.6% 132.3KiB        And 998 smaller methods. Use -n N to show more.
1.1%  4.1% 150.3KiB        filtered data size, the file size is 13.2MiB

I can improve this a little bit by using the struct for serialization as well as deserialization, but when it gets down to it, those initialize closures are just huge.

Immediately after that I realized that I can actually generate the vtable at construction time; I just need to basically launder the constant through a function. (It made sense to me at first that Rust couldn't have cycles in constant references. But then I realized functions allow this in basically every language. So now I'm wondering if the limitation is actually necessary, but I have a workaround anyway.) 520af8a implements this approach, and now total file size is ticking downward:

 File  .text     Size Crate
 5.2%  18.0% 649.3KiB std
 4.4%  15.2% 549.3KiB reqwest
 2.3%   8.0% 288.5KiB clap
 1.8%   6.0% 218.2KiB h2
 1.7%   5.7% 207.3KiB regex
 1.5%   5.3% 191.1KiB regex_syntax
 1.1%   3.8% 137.5KiB tokio
 1.0%   3.4% 124.2KiB hyper
 1.0%   3.4% 121.6KiB tracing_subscriber
 0.8%   2.9% 103.4KiB onvif
 0.7%   2.5%  88.8KiB schema
 0.7%   2.3%  84.6KiB regex_automata
 0.7%   2.3%  83.4KiB static_xml
 0.6%   2.1%  74.3KiB aho_corasick
 0.5%   1.7%  62.7KiB xml
 0.4%   1.4%  51.4KiB http
 0.4%   1.3%  46.1KiB url
 0.4%   1.2%  44.5KiB num_bigint
 0.3%   1.0%  36.8KiB idna
 0.3%   1.0%  36.3KiB chrono
 2.5%   8.8% 316.2KiB And 63 more crates. Use -n N to show more.
29.0% 100.0%   3.5MiB .text section size, the file size is 12.1MiB

so basically this is smaller on nightly; and I can use a feature flag to make it work on stable at the cost of bloat. This is sounding promising.

Major caveat: this code doesn't actually work yet. I'm going to address that next...but I don't expect it's broken in a way that totally invalidates the numbers above. And those numbers can be improved; I can shrink the tables themselves, and I can use the tables to eliminate the serialization bloat also.

Grr. I'm having a tough time with this. Making it (mostly) work did increase binary size; there were some missing bits that apparently cause significant bloat. What's more is that I'm having a tough time understanding what the bloat is. It's not in the .text segment, so cargo bloat doesn't really help. My current table-driven code has shrunk code size by 100 KiB but file size by only 30 KiB. It doesn't seem worth the unsafe and complexity of the approach so far.

[slamb@slamb-workstation ~/git/crates/onvif-rs]$ bloaty camera-debloat-0c65eb4 -- camera-main-ca51ec1
    FILE SIZE        VM SIZE
 --------------  --------------
  +6.6% +31.2Ki  +6.7% +31.2Ki    .data.rel.ro
  +3.5% +26.5Ki  [ = ]       0    .symtab
  +3.6% +23.5Ki  +3.6% +23.5Ki    .rela.dyn
  +0.9% +7.55Ki  +0.9% +7.55Ki    .rodata
  +1.5% +1.54Ki  +1.5% +1.54Ki    .eh_frame_hdr
   +11%    +768  [ = ]       0    [Unmapped]
  -0.1%    -536  -0.1%    -536    .eh_frame
  -6.5% -1.08Ki  -6.5% -1.08Ki    .got
  -0.4% -9.12Ki  [ = ]       0    .strtab
  -6.1% -9.81Ki  -6.1% -9.81Ki    .gcc_except_table
  -2.6%  -101Ki  -2.6%  -101Ki    .text
  -0.2% -30.6Ki  -0.7% -48.8Ki    TOTAL

The obvious question is: are the tables taking up a lot of space? It's a little awkward to measure how big they are, so the easiest thing I can do is make them even bigger and measure how much the file grows. Here I've added 256 bytes to the vtable for each type and 32 bytes to the vtable for each field:

[slamb@slamb-workstation ~/git/crates/onvif-rs]$ bloaty camera-debloat-0c65eb4-vtable-ballast-256-field-ballast-32 -- camera-debloat-0c65eb4
    FILE SIZE        VM SIZE
 --------------  --------------
   +11% +55.0Ki   +11% +55.0Ki    .data.rel.ro
  +0.1% +2.92Ki  [ = ]       0    .strtab
  +0.0% +1.11Ki  +0.0% +1.11Ki    .text
  +0.0%    +184  +0.0%    +184    .eh_frame
  +0.0%    +144  [ = ]       0    .symtab
  +0.0%    +120  +0.0%    +120    .rela.dyn
  +0.0%     +32  +0.0%     +32    .eh_frame_hdr
  +0.1%     +16  +0.1%     +16    .got
  +0.0%     +16  +0.0%     +16    .rodata
  -0.7%      -3  [ = ]       0    .shstrtab
  -6.3%    -480  [ = ]       0    [Unmapped]
  +0.5% +59.1Ki  +0.9% +56.5Ki    TOTAL

so the tables basically are all in .data.rel.ro (which makes sense) and are less than half of the problem. I have ideas to shrink them but it doesn't seem like that should be my priority.

I think it's more about total numbers of symbols increasing size of the .strtab and .rela.dyn. (Possibly also the .rodata and .eh_frame_hdr but those are less significant.)

[slamb@slamb-workstation ~/git/crates/onvif-rs]$ nm camera-main-ca51ec1 | wc -l
31160
[slamb@slamb-workstation ~/git/crates/onvif-rs]$ nm camera-debloat-0c65eb4 | wc -l
32286
[slamb@slamb-workstation ~/git/crates/onvif-rs]$ nm camera-main-ca51ec1 | fgrep schema | wc -l
3397
[slamb@slamb-workstation ~/git/crates/onvif-rs]$ nm camera-debloat-0c65eb4 | fgrep schema | wc -l
2219

I wish Rust had associated statics; then I wouldn't need all the vtable shims added as described in the previous comments. I could also generate one big typeid->type vtable lookup at compile time for all the known types in in a code generator that sees the whole schema at once, but not in a proc macro.

I also wish I could get rid of all these push_value and finalize_field methods, but due to Rust's clever layout of Option, I'm having trouble coming up with an alternative. Maybe I can restructure things so that the Option methods are broken out and get dropped as unused if they're not actually referenced. That'd help a lot for some schemas and maybe even make things worse for others.

Maybe it's worth combining push_value and finalize_field into one method that receives a command to execute by an enum parameter, although that's super ugly.

I suppose I could drop those symbols entirely if I didn't support std::option::Option fields but instead my own:

#[repr(C)]
enum MyOption<T> {
    None = 0,
    Some(T) = 1,
}

but I'd like to make these types feel normal.

There's also some bloat with alloc::raw_vec::RawVec<T,A>::reserve_for_push. I'm surprised they're not getting inlined into push_value and get monomorphized rather than taking the Layout as a parameter. I think I can eliminate these with some very ugly unsafe code, although I still need a per-type wrapper due to the the docs saying "The order of [Vec\'s] fields is completely unspecified", unless I assume that Vec<T> and Vec<Y> have the same layout for any T and Y. Basically I need to call into_raw_parts, do the reallocation myself with the std::alloc API, then call from_raw_parts to convert it back to a Vec.

I'm tempted to actually try changing the std library's RawVec to take the Layout as a parameter and see what it does to Rust code size and speed in general (not just for my crate).

I'm tempted to actually try changing the std library's RawVec to take the Layout as a parameter and see what it does to Rust code size and speed in general (not just for my crate).

I played with this a bit, and it's probably a dead end. RawVec::reserve_for_push calls RawVec::grow_amortized which calls set_ptr at the end. set_ptr divides the allocated length by the size of T at the end. It looks like allocators return the actual allocated size, which may be larger than the requested size, and this allows the Vec to take advantage of the excess. Anyway, division is slow, but division by a constant like 8 can be fast, so the monomorphization is accomplishing something. I tried inlining the division into the caller but the results didn't look good. I could try removing this optimization (using only the requested portion) but that probably wouldn't be great either. It seems possible to only depend on the size, not the full type, but I'm not sure how many of these types are actually the same size anyway.

scottlamb / static-xml

try out table-driven `Visitor::element` and `Visitor::attribute` to reduce code size #5