Lots and lots of things

kddnewton commented 5 days ago

This PR does a lot of stuff, because it was all interconnected and it was easier to do as a group.

Reconfigure error tests
- Previously we were creating syntax trees, comparing them to the recovered tree, and comparing error messages. This was very verbose, and required a lot of manual work.
- The error tests also made heavy use of the DSL module, which I wanted to change for a while. The DSL module now accepts keyword arguments for each field that you want to change, and has sensible defaults for all other fields. This allows you to more quickly create nodes without having to worry about the order of the fields.
- The new error tests are all in their own files, and we check them by parsing them again, formatting the errors inline, and asserting that nothing changed. I also added a bin/prism error utility for creating new files of this kind to make it slightly more manageable.
Move location in node initializers — now that the DSL has been changed, it opens up the possibility to reconfigure out node initializers. The location parameter has been moved to the front (after source and node_id) so keep the common fields at the beginning. The actual node initializers should be seen as private API for all intents and purposes, because they will contain the most churn whenever something changes.
Expose flags on every node type — every node can have common flags (newline or static_literal). This was already present in the C and Rust APIs, but wasn't exposed to the Ruby, JS, or Java APIs. This is now present everywhere. This addresses a long-standing issue request (fixes https://github.com/ruby/prism/issues/1531). Now every node has a common static_literal? method that can be used to check if something is a static literal. These flags are now exposed in the inspect output as well.
A utility method Node#breadth_first_search is added for convenience, I've written this out a bunch of times before.
node_id is now exposed in every API for every node. I was very hesitant to add this functionality to prism, because I was worried about memory usage and performance. However, I don't see any other way around it that will work with the same usage pattern for the way Rails uses node id. I was getting away with it for CRuby by using (line, column) but for Rails it changes based on the ERB being compiled and I don't see how to get it otherwise. Happily, it doesn't actually take up any more space in the C API because we had a 32-bit hole in all of our nodes on 64-bit systems because of pointer alignment, so it doesn't actually increase the size of the nodes. This fixes https://github.com/ruby/prism/issues/2383.

eregon commented 5 days ago

The DSL module now accepts keyword arguments for each field that you want to change, and has sensible defaults for all other fields. This allows you to more quickly create nodes without having to worry about the order of the fields.

Great :tada: (I remember adapting those tests when reordering fields wasn't fun)

Expose flags on every node type — every node can have common flags (newline or static_literal).

Mmh, this will increase the serialized size probably significantly. Although it probably won't increase the Java node objects size (which already had a boolean for newline).

https://github.com/ruby/prism/issues/1531 has pretty limited value for Java, I was mostly thinking of it as a Ruby-specific convenience (since only Ruby defines these #value methods).

OTOH it's likely nice to have newline in Java, but not sure how it will play with the existing support there. This newline flag means "potential newline" but can occur multiple times per line, right? If so it still needs some kind of post-processing, so it might be better to give a more explicit name to avoid incorrect usages.

node_id is now exposed in every API for every node.

Mmh, I wonder if we need it for Java and in the serialized output.

However, I don't see any other way around it that will work with the same usage pattern for the way Rails uses node id.

Do you have a description of the problem? I'd like to understand better why it's needed or if there might be any alternative (for example implicit ordering).

eregon commented 5 days ago

Happily, it doesn't actually take up any more space in the C API because we had a 32-bit hole in all of our nodes on 64-bit systems because of pointer alignment, so it doesn't actually increase the size of the nodes.

Is it the case for all nodes? I'm thinking if a node struct would have less than 4 bytes of padding then the effective size will be + 8 bytes.

A quick repro to show what I mean:

#include <stdio.h>

struct A {
  int a;
  // int nodeID;
};

struct B {
  struct A a;
  int b;
  int* ptr;
};

int main(int argc, char const *argv[]) {
  printf("%ld\n", sizeof(struct B));
  return 0;
}

So without nodeID, B is 16 bytes, and with nodeID B is 24 bytes. (adding int* p1; at the start of B still gives a 8 bytes increase for nodeID)

kddnewton commented 5 days ago

Mmh, this will increase the serialized size probably significantly.

I'm not sure this is true. For most files it will probably add 1 byte for every node that didn't already have flags on it. But calls, strings, integers, parameters, etc. all already had flags on them, and now they're part of the same integer.

This newline flag means "potential newline" but can occur multiple times per line, right?

Yes, it matches the FL_NEWLINE flag from CRuby. It indicates that the node could potentially emit a :line event in tracepoint, depending on the order the compiler assigns the lines.

Do you have a description of the problem? I'd like to understand better why it's needed or if there might be any alternative (for example https://github.com/ruby/prism/issues/2383#issuecomment-1934920712).

It's a combination of a lot of things. Here is the code in Rails that's relevant: https://github.com/rails/rails/blob/4fa56814f18fd3da49c83931fa773caa727d8096/actionview/lib/action_view/template.rb#L232-L248.

Every instruction in YARV keeps around an associated node_id. So I don't want to go the route of making another pass through the AST in order to assign them, because that would mean an entire other pass before we get to compile. If we instead did it while compiling, then the node id order is entirely dependent on the compiler, and we'd need to replicate that some how in order to find the next node.

I don't love the feature, but at this point it's one of the last things blocking us getting merged into CRuby, so I'd like to get this moving.

Is it the case for all nodes? I'm thinking if a node struct would have less than 4 bytes of padding then the effective size will be + 8 bytes.

Yeah it's the case for all nodes. All nodes have 16 bytes for type, 16 bytes for flags, then a 32-bit hole, then a 64-bit pointer to the start of the node.

Mmh, I wonder if we need it for Java and in the serialized output.

Because we're exposing node_id in the Ruby API, we have to have it in the serialization API because of our FFI backend, otherwise it wouldn't work through the gem on JRuby/TruffleRuby. So it's got to be there either way. I'm fine removing it from the Java nodes though if it's not going to be used.

I feel the node_id stuff in CRuby was done in a rather rushed way, I think it would be worth to gather the requirements, design a proper API for this in Prism, and clarify what information the Ruby implementations need to keep and provide.

I think it would be good to go through this exercise, but I'd really like to get prism merged sooner rather than later, so I'd like to do this after. I don't think it will be a problem to refactor this later, as it's basically a continuation of what is already there.

For instance TruffleRuby already keeps the startOffset & length of every node, isn't that already enough to identify a Node?

It's not enough, unfortunately. The ERB use case makes this not sufficient because nodes can move when ERB compiles them.

eregon commented 5 days ago

Thank you for 23f3a1dd17e031b5efa77909c60c23b3b2c74ef0, I think it's the best trade-off for now. I understand for now you need node_id for CRuby and I wouldn't want to delay that either.

I will try to take a look at the Rails use-case to understand it better and whether it could be done another way.

eregon commented 5 days ago

Could you clarify this part:

It's not enough, unfortunately. The ERB use case makes this not sufficient because nodes can move when ERB compiles them.

What do you mean by moving? Is it that ERB adds some prelude? Or that ERB doesn't always compile an (unchanged) template to the same generated Ruby code? Or that the template changes and then it can still works if the modification is very minor?

eregon commented 5 days ago

Yeah it's the case for all nodes. All nodes have 16 bytes for type, 16 bytes for flags, then a 32-bit hole, then a 64-bit pointer to the start of the node.

Ah right, https://github.com/ruby/prism/blob/ad5efb260889af3ee2245e63142db2652d182f53/templates/include/prism/ast.h.erb#L126-L150 so there are pointers inside pm_node_t (via pm_location_t) and I forgot that C makes the whole size of the struct a multiple of its widest scalar member (i.e. a pointer here, so multiple of 8 bytes and there was 4 byte padding before this change).

If the location becomes a pair of uint32_t then the footprint increase would be visible, and I could repro that in https://gist.github.com/eregon/e6cd83df15af5e0338982b46d26a7be7 where pm_constant_read_node goes from 16 to 20 bytes. But that's not the current situation and indeed printing a few nodes, the sizeof() seems the same before & after this change, nice.

ruby / prism

Lots and lots of things #2924