Open froydnj opened 1 year ago
Yes, when serialized it's already 32-bit ints and actually in Java it's a 32-bit signed integer, so only 31 bits. Related: https://github.com/ruby/yarp/issues/872#issuecomment-1572518200 and https://github.com/ruby/yarp/issues/1407#issuecomment-1708042576 Also https://github.com/ruby/yarp/pull/1532 for stats on the savings to not have location fields, although that is with varint "compression" so not directly related to the footprint for C structs.
I think it'd be good to store locaitons in C as uint32_t start_offset; uint32_t length
, but that is a massive change with huge conflict potential, so I think we'd need @kddnewton to agree to not merge anything from when it's being worked on until it's merged (or abandoned).
It's quite a big change because yarp.c in many places use the fact the start
is a char*
and not just an offset.
I assume we're already limited in very non-obvious ways, so I'm not particularly worried about switching to 32-bits. I think this is a good idea.
I disagree with @eregon here I don't think we can halt merging entirely in service of getting this over the finish line, so I'd like to explore other ways to do this incrementally.
I think one way to start would be to modify a single location field on a single node. You would do this by creating a new kind of field
called OffsetsField
(as opposed to LocationField
). Presumably this could be done in the various yp_*_node_create
functions where it would subtract from the beginning of the parser. That way we could test out the implications of a single node's fields while still being able to keep the codebase healthy.
Officially got confirmation that we can make this happen! Yay! https://bugs.ruby-lang.org/issues/20488#note-1
YARP provides precise location information for any number of things in its AST, which is great! But each of those locations on a 64-bit platform costs 16 bytes, which means every YARP node (
yp_node_t
) on those platforms is at least 24 bytes, which is rather large.For some concrete numbers, nearly half (~47%) of the memory size as measured by
YARP::Debug.memsize
on a subset of Stripe's codebase is consumed by locations.YARP::Debug.memsize
doesn't measure this as-is; the count(s) of locations were obtained from this branch (and note whatever the branch counts needs to be multiplied by 16 to obtain the size):https://github.com/ruby/yarp/compare/main...froydnj:froydnj-location-counting
so I might have done something wrong.
Using 32-bit offsets (well, probably 31-bit offsets so there's still some way to mark locations as "invalid" or "unavailable") would immediately shrink the memory required by ~25%; I haven't put together a proof-of-concept branch with that change, but my belief is that there would be a performance gain as well, though probably not of the same magnitude. One can also imagine more complicated schemes where nodes hold a 32-bit key into some side table, which would shrink things still more, but one step at a time.
This change would limit the kinds of Ruby source files that YARP could consume. But other Ruby-consuming tools/implementations (Sorbet and--I think--TruffleRuby cc @eregon ) already maintain 32-bit offsets internally. I believe Clang and Go also only store 32-bit offsets for their locations. If Clang and Go can get away with it, I think YARP can too.