ruby / prism

Prism Ruby parser
https://ruby.github.io/prism/
MIT License
846 stars 142 forks source link

consider using 32-bit offsets for start and end in `yp_location_t` #1566

Open froydnj opened 1 year ago

froydnj commented 1 year ago

YARP provides precise location information for any number of things in its AST, which is great! But each of those locations on a 64-bit platform costs 16 bytes, which means every YARP node (yp_node_t) on those platforms is at least 24 bytes, which is rather large.

For some concrete numbers, nearly half (~47%) of the memory size as measured by YARP::Debug.memsize on a subset of Stripe's codebase is consumed by locations. YARP::Debug.memsize doesn't measure this as-is; the count(s) of locations were obtained from this branch (and note whatever the branch counts needs to be multiplied by 16 to obtain the size):

https://github.com/ruby/yarp/compare/main...froydnj:froydnj-location-counting

so I might have done something wrong.

Using 32-bit offsets (well, probably 31-bit offsets so there's still some way to mark locations as "invalid" or "unavailable") would immediately shrink the memory required by ~25%; I haven't put together a proof-of-concept branch with that change, but my belief is that there would be a performance gain as well, though probably not of the same magnitude. One can also imagine more complicated schemes where nodes hold a 32-bit key into some side table, which would shrink things still more, but one step at a time.

This change would limit the kinds of Ruby source files that YARP could consume. But other Ruby-consuming tools/implementations (Sorbet and--I think--TruffleRuby cc @eregon ) already maintain 32-bit offsets internally. I believe Clang and Go also only store 32-bit offsets for their locations. If Clang and Go can get away with it, I think YARP can too.

eregon commented 1 year ago

Yes, when serialized it's already 32-bit ints and actually in Java it's a 32-bit signed integer, so only 31 bits. Related: https://github.com/ruby/yarp/issues/872#issuecomment-1572518200 and https://github.com/ruby/yarp/issues/1407#issuecomment-1708042576 Also https://github.com/ruby/yarp/pull/1532 for stats on the savings to not have location fields, although that is with varint "compression" so not directly related to the footprint for C structs.

I think it'd be good to store locaitons in C as uint32_t start_offset; uint32_t length, but that is a massive change with huge conflict potential, so I think we'd need @kddnewton to agree to not merge anything from when it's being worked on until it's merged (or abandoned). It's quite a big change because yarp.c in many places use the fact the start is a char* and not just an offset.

kddnewton commented 1 year ago

I assume we're already limited in very non-obvious ways, so I'm not particularly worried about switching to 32-bits. I think this is a good idea.

I disagree with @eregon here I don't think we can halt merging entirely in service of getting this over the finish line, so I'd like to explore other ways to do this incrementally.

I think one way to start would be to modify a single location field on a single node. You would do this by creating a new kind of field called OffsetsField (as opposed to LocationField). Presumably this could be done in the various yp_*_node_create functions where it would subtract from the beginning of the parser. That way we could test out the implications of a single node's fields while still being able to keep the codebase healthy.

kddnewton commented 5 months ago

Officially got confirmation that we can make this happen! Yay! https://bugs.ruby-lang.org/issues/20488#note-1