Open scottlamb opened 2 years ago
I just realized that on stable Rust, it's currently impossible to calculate offsets within structures from a const
context => no compiled-in struct (de)serialization tables. That's disappointing. Plan:
memoffset
's unstable_const
feature as a proof-of-concept and see what sizes are possible.once_cell
and see how much bloat this adds back. I suppose I could even decide between compile-time or runtime with a feature flag or something.It looks like folks are working on stabilizing this stuff; I hope that trade-off doesn't last long.
There's another problem I didn't anticipate: some schemas have cycles. Here's a trivial one (a type that refers directly back to itself):
#[derive(PartialEq, Debug, Serialize, Deserialize)]
#[static_xml(prefix = "tt", namespace = "tt: http://www.onvif.org/ver10/schema")]
pub struct Transport {
// Defines the network protocol for streaming, either UDP=RTP/UDP,
// RTSP=RTP/RTSP/TCP or HTTP=RTP/RTSP/HTTP/TCP
#[static_xml(prefix = "tt", rename = "Protocol")]
pub protocol: TransportProtocol,
// Optional element to describe further tunnel options. This element is
// normally not needed
#[static_xml(prefix = "tt", rename = "Tunnel")]
pub tunnel: Vec<Transport>,
}
The type vtable for transport
(which should know how to deserialize it) needs to refer to the struct table for transport
(with per-field information), which needs to refer back to the type vtable to know how to deserialize the tunnel
field.
Rust understandably doesn't let me generate const
s which have Rust references in a cycle. E.g., the following won't compile.
struct Foo(&'static Foo);
const A: Foo = Foo(&B);
const B: Foo = Foo(&A);
Similarly, producing the vtable from a once_cell::sync::Lazy
and evaluating fields' lazy cells recursively can't possibly work, and the docs say this:
It is an error to reentrantly initialize the cell from f. The exact outcome is unspecified. Current implementation deadlocks, but this may be changed to a panic in the future.
I guess I can build these in a lazy cell, have them to reference the lazy cell itself (not using its Deref
), and then dereference when actually accessing the field.
Or something like building a (mutex-guarded) hash map keyed by std::any::TypeId
but that's probably not any better and raises the question of what fills it.
I think this might be a dead end. :-( The problem is that I don't see a way to generate the vtables at compile-time (for the reasons described above), and the code to generate them more than cancels out the gains of the table-driven approach.
[slamb@slamb-workstation ~/git/crates/onvif-rs]$ cargo bloat --release --example camera --filter schema
Compiling static-xml v0.1.0 (/home/slamb/git/static-xml/static-xml)
Compiling xsd-types v0.1.0 (/home/slamb/git/crates/xsd-parser-rs/xsd-types)
Compiling static-xml-derive v0.1.0 (/home/slamb/git/static-xml/static-xml-derive)
Compiling schema v0.1.0 (/home/slamb/git/crates/onvif-rs/schema)
Compiling onvif v0.1.0 (/home/slamb/git/crates/onvif-rs/onvif)
Finished release [optimized] target(s) in 36.51s
Analyzing target/release/examples/camera
File .text Size Crate Name
0.0% 0.0% 1.7KiB schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0% 0.0% 1.2KiB schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0% 0.0% 1.2KiB schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0% 0.0% 1.2KiB schema schema::soap_envelope::Fault::is_unauthorized
0.0% 0.0% 1.1KiB schema schema::onvif::_::<impl static_xml::ser::Serialize for schema::onvif::StreamSetup>::write_children
0.0% 0.0% 1.1KiB schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0% 0.0% 894B schema <schema::onvif::Ptzconfiguration as core::fmt::Debug>::fmt
0.0% 0.0% 885B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0% 0.0% 851B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0% 0.0% 849B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0% 0.0% 814B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0% 0.0% 802B schema schema::onvif::_::<impl static_xml::ser::Serialize for schema::onvif::Ipaddress>::write_children
0.0% 0.0% 764B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0% 0.0% 751B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0% 0.0% 734B schema schema::onvif::_::<impl static_xml::ser::Serialize for schema::onvif::MetadataConfiguration>::write_children
0.0% 0.0% 710B schema <T as static_xml::ser::SerializeField>::write
0.0% 0.0% 710B schema <T as static_xml::ser::SerializeField>::write
0.0% 0.0% 706B schema <T as static_xml::ser::SerializeField>::write
0.0% 0.0% 681B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
0.0% 0.0% 681B schema once_cell::imp::OnceCell<T>::initialize::{{closure}}
1.0% 3.6% 132.3KiB And 998 smaller methods. Use -n N to show more.
1.1% 4.1% 150.3KiB filtered data size, the file size is 13.2MiB
I can improve this a little bit by using the struct for serialization as well as deserialization, but when it gets down to it, those initialize closures are just huge.
Immediately after that I realized that I can actually generate the vtable at construction time; I just need to basically launder the constant through a function. (It made sense to me at first that Rust couldn't have cycles in constant references. But then I realized functions allow this in basically every language. So now I'm wondering if the limitation is actually necessary, but I have a workaround anyway.) 520af8a implements this approach, and now total file size is ticking downward:
File .text Size Crate
5.2% 18.0% 649.3KiB std
4.4% 15.2% 549.3KiB reqwest
2.3% 8.0% 288.5KiB clap
1.8% 6.0% 218.2KiB h2
1.7% 5.7% 207.3KiB regex
1.5% 5.3% 191.1KiB regex_syntax
1.1% 3.8% 137.5KiB tokio
1.0% 3.4% 124.2KiB hyper
1.0% 3.4% 121.6KiB tracing_subscriber
0.8% 2.9% 103.4KiB onvif
0.7% 2.5% 88.8KiB schema
0.7% 2.3% 84.6KiB regex_automata
0.7% 2.3% 83.4KiB static_xml
0.6% 2.1% 74.3KiB aho_corasick
0.5% 1.7% 62.7KiB xml
0.4% 1.4% 51.4KiB http
0.4% 1.3% 46.1KiB url
0.4% 1.2% 44.5KiB num_bigint
0.3% 1.0% 36.8KiB idna
0.3% 1.0% 36.3KiB chrono
2.5% 8.8% 316.2KiB And 63 more crates. Use -n N to show more.
29.0% 100.0% 3.5MiB .text section size, the file size is 12.1MiB
so basically this is smaller on nightly; and I can use a feature flag to make it work on stable at the cost of bloat. This is sounding promising.
Major caveat: this code doesn't actually work yet. I'm going to address that next...but I don't expect it's broken in a way that totally invalidates the numbers above. And those numbers can be improved; I can shrink the tables themselves, and I can use the tables to eliminate the serialization bloat also.
Grr. I'm having a tough time with this. Making it (mostly) work did increase binary size; there were some missing bits that apparently cause significant bloat. What's more is that I'm having a tough time understanding what the bloat is. It's not in the .text
segment, so cargo bloat
doesn't really help. My current table-driven code has shrunk code size by 100 KiB but file size by only 30 KiB. It doesn't seem worth the unsafe
and complexity of the approach so far.
[slamb@slamb-workstation ~/git/crates/onvif-rs]$ bloaty camera-debloat-0c65eb4 -- camera-main-ca51ec1
FILE SIZE VM SIZE
-------------- --------------
+6.6% +31.2Ki +6.7% +31.2Ki .data.rel.ro
+3.5% +26.5Ki [ = ] 0 .symtab
+3.6% +23.5Ki +3.6% +23.5Ki .rela.dyn
+0.9% +7.55Ki +0.9% +7.55Ki .rodata
+1.5% +1.54Ki +1.5% +1.54Ki .eh_frame_hdr
+11% +768 [ = ] 0 [Unmapped]
-0.1% -536 -0.1% -536 .eh_frame
-6.5% -1.08Ki -6.5% -1.08Ki .got
-0.4% -9.12Ki [ = ] 0 .strtab
-6.1% -9.81Ki -6.1% -9.81Ki .gcc_except_table
-2.6% -101Ki -2.6% -101Ki .text
-0.2% -30.6Ki -0.7% -48.8Ki TOTAL
The obvious question is: are the tables taking up a lot of space? It's a little awkward to measure how big they are, so the easiest thing I can do is make them even bigger and measure how much the file grows. Here I've added 256 bytes to the vtable for each type and 32 bytes to the vtable for each field:
[slamb@slamb-workstation ~/git/crates/onvif-rs]$ bloaty camera-debloat-0c65eb4-vtable-ballast-256-field-ballast-32 -- camera-debloat-0c65eb4
FILE SIZE VM SIZE
-------------- --------------
+11% +55.0Ki +11% +55.0Ki .data.rel.ro
+0.1% +2.92Ki [ = ] 0 .strtab
+0.0% +1.11Ki +0.0% +1.11Ki .text
+0.0% +184 +0.0% +184 .eh_frame
+0.0% +144 [ = ] 0 .symtab
+0.0% +120 +0.0% +120 .rela.dyn
+0.0% +32 +0.0% +32 .eh_frame_hdr
+0.1% +16 +0.1% +16 .got
+0.0% +16 +0.0% +16 .rodata
-0.7% -3 [ = ] 0 .shstrtab
-6.3% -480 [ = ] 0 [Unmapped]
+0.5% +59.1Ki +0.9% +56.5Ki TOTAL
so the tables basically are all in .data.rel.ro
(which makes sense) and are less than half of the problem. I have ideas to shrink them but it doesn't seem like that should be my priority.
I think it's more about total numbers of symbols increasing size of the .strtab
and .rela.dyn
. (Possibly also the .rodata
and .eh_frame_hdr
but those are less significant.)
[slamb@slamb-workstation ~/git/crates/onvif-rs]$ nm camera-main-ca51ec1 | wc -l
31160
[slamb@slamb-workstation ~/git/crates/onvif-rs]$ nm camera-debloat-0c65eb4 | wc -l
32286
[slamb@slamb-workstation ~/git/crates/onvif-rs]$ nm camera-main-ca51ec1 | fgrep schema | wc -l
3397
[slamb@slamb-workstation ~/git/crates/onvif-rs]$ nm camera-debloat-0c65eb4 | fgrep schema | wc -l
2219
I wish Rust had associated statics; then I wouldn't need all the vtable
shims added as described in the previous comments. I could also generate one big typeid->type vtable lookup at compile time for all the known types in in a code generator that sees the whole schema at once, but not in a proc macro.
I also wish I could get rid of all these push_value
and finalize_field
methods, but due to Rust's clever layout of Option
, I'm having trouble coming up with an alternative. Maybe I can restructure things so that the Option
methods are broken out and get dropped as unused if they're not actually referenced. That'd help a lot for some schemas and maybe even make things worse for others.
Maybe it's worth combining push_value
and finalize_field
into one method that receives a command to execute by an enum parameter, although that's super ugly.
I suppose I could drop those symbols entirely if I didn't support std::option::Option
fields but instead my own:
#[repr(C)]
enum MyOption<T> {
None = 0,
Some(T) = 1,
}
but I'd like to make these types feel normal.
There's also some bloat with alloc::raw_vec::RawVec<T,A>::reserve_for_push
. I'm surprised they're not getting inlined into push_value
and get monomorphized rather than taking the Layout
as a parameter. I think I can eliminate these with some very ugly unsafe code, although I still need a per-type wrapper due to the the docs saying "The order of [Vec\
's] fields is completely unspecified", unless I assume that Vec<T>
and Vec<Y>
have the same layout for any T
and Y
. Basically I need to call into_raw_parts
, do the reallocation myself with the std::alloc
API, then call from_raw_parts
to convert it back to a Vec
.
I'm tempted to actually try changing the std
library's RawVec
to take the Layout
as a parameter and see what it does to Rust code size and speed in general (not just for my crate).
I'm tempted to actually try changing the
std
library'sRawVec
to take theLayout
as a parameter and see what it does to Rust code size and speed in general (not just for my crate).
I played with this a bit, and it's probably a dead end. RawVec::reserve_for_push
calls RawVec::grow_amortized
which calls set_ptr
at the end. set_ptr
divides the allocated length by the size of T
at the end. It looks like allocators return the actual allocated size, which may be larger than the requested size, and this allows the Vec
to take advantage of the excess. Anyway, division is slow, but division by a constant like 8 can be fast, so the monomorphization is accomplishing something. I tried inlining the division into the caller but the results didn't look good. I could try removing this optimization (using only the requested portion) but that probably wouldn't be great either. It seems possible to only depend on the size, not the full type, but I'm not sure how many of these types are actually the same size anyway.
Try out the approach to reduce the per-type overhead of these methods described in the README:
Currently
static-xml-derive
writes explicit generated code.E.g., with this type:
The
Deserialize
macro produces code roughly as follows:I believe this is close to the minimal size with this approach. Next I'd like to experiment with a different approach in which the
Visitor
impl is replaced with a table that holds the offset withinFooVisitor
of each field, and a pointer to anelement
function. The generated code would useunsafe
, but soundness only has to be proved once in the generator, and this seems worthwhile if it can achieve significant code size reduction.