ruby / prism

Prism Ruby parser
https://ruby.github.io/prism/
MIT License
790 stars 134 forks source link

Lots and lots of things #2924

Closed kddnewton closed 5 days ago

kddnewton commented 5 days ago

This PR does a lot of stuff, because it was all interconnected and it was easier to do as a group.

eregon commented 5 days ago

The DSL module now accepts keyword arguments for each field that you want to change, and has sensible defaults for all other fields. This allows you to more quickly create nodes without having to worry about the order of the fields.

Great :tada: (I remember adapting those tests when reordering fields wasn't fun)

Expose flags on every node type — every node can have common flags (newline or static_literal).

Mmh, this will increase the serialized size probably significantly. Although it probably won't increase the Java node objects size (which already had a boolean for newline).

https://github.com/ruby/prism/issues/1531 has pretty limited value for Java, I was mostly thinking of it as a Ruby-specific convenience (since only Ruby defines these #value methods).

OTOH it's likely nice to have newline in Java, but not sure how it will play with the existing support there. This newline flag means "potential newline" but can occur multiple times per line, right? If so it still needs some kind of post-processing, so it might be better to give a more explicit name to avoid incorrect usages.

node_id is now exposed in every API for every node.

Mmh, I wonder if we need it for Java and in the serialized output.

However, I don't see any other way around it that will work with the same usage pattern for the way Rails uses node id.

Do you have a description of the problem? I'd like to understand better why it's needed or if there might be any alternative (for example implicit ordering).

eregon commented 5 days ago

Happily, it doesn't actually take up any more space in the C API because we had a 32-bit hole in all of our nodes on 64-bit systems because of pointer alignment, so it doesn't actually increase the size of the nodes.

Is it the case for all nodes? I'm thinking if a node struct would have less than 4 bytes of padding then the effective size will be + 8 bytes.

A quick repro to show what I mean:

#include <stdio.h>

struct A {
  int a;
  // int nodeID;
};

struct B {
  struct A a;
  int b;
  int* ptr;
};

int main(int argc, char const *argv[]) {
  printf("%ld\n", sizeof(struct B));
  return 0;
}

So without nodeID, B is 16 bytes, and with nodeID B is 24 bytes. (adding int* p1; at the start of B still gives a 8 bytes increase for nodeID)

kddnewton commented 5 days ago

Mmh, this will increase the serialized size probably significantly.

I'm not sure this is true. For most files it will probably add 1 byte for every node that didn't already have flags on it. But calls, strings, integers, parameters, etc. all already had flags on them, and now they're part of the same integer.

This newline flag means "potential newline" but can occur multiple times per line, right?

Yes, it matches the FL_NEWLINE flag from CRuby. It indicates that the node could potentially emit a :line event in tracepoint, depending on the order the compiler assigns the lines.

Do you have a description of the problem? I'd like to understand better why it's needed or if there might be any alternative (for example https://github.com/ruby/prism/issues/2383#issuecomment-1934920712).

It's a combination of a lot of things. Here is the code in Rails that's relevant: https://github.com/rails/rails/blob/4fa56814f18fd3da49c83931fa773caa727d8096/actionview/lib/action_view/template.rb#L232-L248.

Every instruction in YARV keeps around an associated node_id. So I don't want to go the route of making another pass through the AST in order to assign them, because that would mean an entire other pass before we get to compile. If we instead did it while compiling, then the node id order is entirely dependent on the compiler, and we'd need to replicate that some how in order to find the next node.

I don't love the feature, but at this point it's one of the last things blocking us getting merged into CRuby, so I'd like to get this moving.

Is it the case for all nodes? I'm thinking if a node struct would have less than 4 bytes of padding then the effective size will be + 8 bytes.

Yeah it's the case for all nodes. All nodes have 16 bytes for type, 16 bytes for flags, then a 32-bit hole, then a 64-bit pointer to the start of the node.

Mmh, I wonder if we need it for Java and in the serialized output.

Because we're exposing node_id in the Ruby API, we have to have it in the serialization API because of our FFI backend, otherwise it wouldn't work through the gem on JRuby/TruffleRuby. So it's got to be there either way. I'm fine removing it from the Java nodes though if it's not going to be used.

I feel the node_id stuff in CRuby was done in a rather rushed way, I think it would be worth to gather the requirements, design a proper API for this in Prism, and clarify what information the Ruby implementations need to keep and provide.

I think it would be good to go through this exercise, but I'd really like to get prism merged sooner rather than later, so I'd like to do this after. I don't think it will be a problem to refactor this later, as it's basically a continuation of what is already there.

For instance TruffleRuby already keeps the startOffset & length of every node, isn't that already enough to identify a Node?

It's not enough, unfortunately. The ERB use case makes this not sufficient because nodes can move when ERB compiles them.

eregon commented 5 days ago

Thank you for 23f3a1dd17e031b5efa77909c60c23b3b2c74ef0, I think it's the best trade-off for now. I understand for now you need node_id for CRuby and I wouldn't want to delay that either.

I will try to take a look at the Rails use-case to understand it better and whether it could be done another way.

eregon commented 5 days ago

Could you clarify this part:

It's not enough, unfortunately. The ERB use case makes this not sufficient because nodes can move when ERB compiles them.

What do you mean by moving? Is it that ERB adds some prelude? Or that ERB doesn't always compile an (unchanged) template to the same generated Ruby code? Or that the template changes and then it can still works if the modification is very minor?

eregon commented 5 days ago

Yeah it's the case for all nodes. All nodes have 16 bytes for type, 16 bytes for flags, then a 32-bit hole, then a 64-bit pointer to the start of the node.

Ah right, https://github.com/ruby/prism/blob/ad5efb260889af3ee2245e63142db2652d182f53/templates/include/prism/ast.h.erb#L126-L150 so there are pointers inside pm_node_t (via pm_location_t) and I forgot that C makes the whole size of the struct a multiple of its widest scalar member (i.e. a pointer here, so multiple of 8 bytes and there was 4 byte padding before this change).

If the location becomes a pair of uint32_t then the footprint increase would be visible, and I could repro that in https://gist.github.com/eregon/e6cd83df15af5e0338982b46d26a7be7 where pm_constant_read_node goes from 16 to 20 bytes. But that's not the current situation and indeed printing a few nodes, the sizeof() seems the same before & after this change, nice.