Feature proposal: annotate source with token/character counts for functions

sparr commented 1 year ago

I want to add per-function token and character counts, pre and post minification, to https://github.com/sparr/pico8lib. My initial idea was to use a script to parse the functions out of a source file, then feed them separately to shrinko8 -c, parse the output, prepend it as a comment to the function, then reassemble the source file. However, it occurs to me that a lot of the necessary functionality to do this better already exists in this project.

My proposal is a mode for shrinko8 that does all of this directly. Parse a cart, identify all the functions, count their tokens and characters, minify them, count those tokens and characters, insert a comment into the original cart above the function (possibly before existing comments, and overwriting a previous comment from this process) and emit the annotated original cart.

I'm willing to do most of the work to make this happen, but I find the code here a bit daunting. I've made it to process_code where it calls tokenize then parse, and I can look through Nodes under root, but I am not sure how to get back from a Node to tokens that count_tokens wants, or how to turn a Node back into a cart after I've modified some of its grandchildren. Generally moving back and forth between the source, tokens, and nodes, in a way that keeps them mapped to each other for modifying one based on the other, is a mystery to me so far. I would appreciate any pointers you can share before I get started on this. Even if this isn't a welcome feature for the main repo, I'd still like to pursue it as a fork or patch so I can use it myself.

thisismypassport commented 1 year ago

Currently, the minification steps modify both the node tree and tokens array. I'm not too happy about this either, and will probably improve this as part of a branch I'm (potentially) working on.

Either way, you can traverse over the tokens in a node tree via traverse_tokens and add them to your own array, then count that.

I'm not sure how easy it is to split the code and minify each function separately - especially if you want to do renaming, as that's an inherently global operation. It might be better to minify the whole thing at once.

Currently, minify_code converts a node tree back to source at the end - using two approaches depending on whether minify_wspace was requested. Splitting this logic into a separate function would definitely be nice. (I may also do this in my branch)

However, it sounds like you don't want to output minified code, just lightly modified code. The logic to do so would likely be somewhat similar (in structure) to the bottom else case of minify_code (minus the complications) - traverse over tokens, append their contents and the wspace between them, except with changes to the comments as needed. (The comments are part of the wspace between tokens and aren't captured explicitly in the node tree or the token list)

sparr commented 1 year ago

Below are a simple cart I'm testing with and a jsonpickled representation of root. I can see the token that represents the function keyword, but I can't see any obvious way to capture or manipulate the whole function definition (from function to end). It feels like I'm missing some layer of the puzzle here.

pico-8 cartridge // http://www.pico-8.com
version 18
__lua__
-- comment
function funcname()
 return true
end

{
  "children": [ {
    "body": { "py/id": 25 },
    "children": [
      {
        "children": { "py/tuple": [] },
        "endidx": 19,
        "idx": 11,
        "modified": false,
        "parent": { "py/id": 2 },
        "py/object": "pico_tokenize.Token",
        "source": {
          "cart": {
            "code": "-- comment\nfunction funcname()\n return true\nend",
            "code_map": [ {
              "py/newargs": { "py/tuple": [ { "py/tuple": [ 0, "test.p8", "-- comment\nfunction funcname()\n return true\nend", 0, 3 ] } ] },
              "py/object": "pico_cart.CodeMapping",
              "py/seq": [ 0, "test.p8", "-- comment\nfunction funcname()\n return true\nend", 0, 3 ]
            } ],
            "code_rom": null,
            "meta": { "default_factory": { "py/type": "builtins.list" }, "py/object": "collections.defaultdict" },
            "name": "test.p8",
            "platform": "l",
            "py/object": "pico_cart.Cart",
            "rom": { "py/object": "pico_defs.Memory" },
            "screenshot": null,
            "version_id": 18,
            "version_tuple": { "py/tuple": [ 0, 1, 23, 0 ] }
          },
          "mappings": { "py/id": 9 },
          "py/object": "pico_process.CartSource"
        },
        "type": { "py/object": "pico_tokenize.TokenType", "value": 3 },
        "value": "function",
        "vline": 1
      }, {
        "children": [ {
          "children": { "py/tuple": [] },
          "endidx": 28,
          "idx": 20,
          "modified": false,
          "parent": { "py/id": 12 },
          "py/object": "pico_tokenize.Token",
          "source": { "py/id": 6 },
          "type": { "py/object": "pico_tokenize.TokenType", "value": 2 },
          "value": "funcname",
          "vline": 1
        } ],
        "endidx": 28,
        "idx": 20,
        "kind": { "py/object": "pico_parse.VarKind", "value": 1 },
        "name": "funcname",
        "new": false,
        "parent": { "py/id": 2 },
        "parent_scope": {
          "depth": -1,
          "funcdepth": 0,
          "items": { "_ENV": { "implicit": true, "keys_kind": null, "name": "_ENV", "py/object": "pico_parse.Local", "scope": { "py/id": 19 } } },
          "parent": null,
          "py/object": "pico_parse.Scope"
        },
        "py/object": "pico_parse.Node",
        "scopespec": null,
        "source": { "py/id": 6 },
        "type": { "py/object": "pico_parse.NodeType", "value": 0 },
        "value": null,
        "var": { "keys_kind": null, "name": "funcname", "py/object": "pico_parse.Global" },
        "var_kind": null
      }, {
        "children": { "py/tuple": [] },
        "endidx": 29,
        "idx": 28,
        "modified": false,
        "parent": { "py/id": 2 },
        "py/object": "pico_tokenize.Token",
        "source": { "py/id": 6 },
        "type": { "py/object": "pico_tokenize.TokenType", "value": 4 },
        "value": "(",
        "vline": 1
      }, {
        "children": { "py/tuple": [] },
        "endidx": 30,
        "idx": 29,
        "modified": false,
        "parent": { "py/id": 2 },
        "py/object": "pico_tokenize.Token",
        "source": { "py/id": 6 },
        "type": { "py/id": 23 },
        "value": ")",
        "vline": 1
      }, {
        "children": [ {
          "children": [
            {
              "children": { "py/tuple": [] },
              "endidx": 38,
              "idx": 32,
              "modified": false,
              "parent": { "py/id": 27 },
              "py/object": "pico_tokenize.Token",
              "source": { "py/id": 6 },
              "type": { "py/id": 5 },
              "value": "return",
              "vline": 2
            }, {
              "children": [ {
                "children": { "py/tuple": [] },
                "endidx": 43,
                "idx": 39,
                "modified": false,
                "parent": { "py/id": 30 },
                "py/object": "pico_tokenize.Token",
                "source": { "py/id": 6 },
                "type": { "py/id": 5 },
                "value": "true",
                "vline": 2
              } ],
              "endidx": 43,
              "idx": 39,
              "parent": { "py/id": 27 },
              "py/object": "pico_parse.Node",
              "scopespec": null,
              "source": { "py/id": 6 },
              "token": { "py/id": 32 },
              "type": { "py/object": "pico_parse.NodeType", "value": 3 },
              "value": null
            }
          ],
          "endidx": 43,
          "idx": 32,
          "items": [ { "py/id": 30 } ],
          "parent": { "py/id": 25 },
          "py/object": "pico_parse.Node",
          "scopespec": null,
          "source": { "py/id": 6 },
          "type": { "py/object": "pico_parse.NodeType", "value": 24 },
          "value": null
        } ],
        "endidx": 43,
        "idx": 32,
        "parent": { "py/id": 2 },
        "py/object": "pico_parse.Node",
        "scopespec": { "py/tuple": [ false, [] ] },
        "source": { "py/id": 6 },
        "stmts": [ { "py/id": 27 } ],
        "type": { "py/object": "pico_parse.NodeType", "value": 29 },
        "value": null
      }, {
        "children": { "py/tuple": [] },
        "endidx": 47,
        "idx": 44,
        "modified": false,
        "parent": { "py/id": 2 },
        "py/object": "pico_tokenize.Token",
        "source": { "py/id": 6 },
        "type": { "py/id": 5 },
        "value": "end",
        "vline": 3
      }
    ],
    "endidx": 47,
    "idx": 11,
    "kind": null,
    "name": "funcname",
    "params": [],
    "parent": { "py/id": 0 },
    "py/object": "pico_parse.Node",
    "scopespec": { "depth": 1, "funcdepth": 1, "items": {}, "parent": { "py/id": 19 }, "py/object": "pico_parse.Scope" },
    "source": { "py/id": 6 },
    "target": { "py/id": 12 },
    "type": { "py/object": "pico_parse.NodeType", "value": 15 },
    "value": null
  } ],
  "endidx": 47,
  "globals": { "__dict__": {}, "funcname": { "py/id": 18 }, "py/object": "utils.LazyDict" },
  "idx": 11,
  "parent": null,
  "py/object": "pico_parse.Node",
  "scopespec": { "py/tuple": [ false, [] ] },
  "source": { "py/id": 6 },
  "stmts": [ { "py/id": 2 } ],
  "type": { "py/id": 36 },
  "value": null
}

thisismypassport commented 1 year ago

Assuming I understood what you're asking for correctly, the idx and endidx attributes specify the range of the function's tokens within the source (e.g. node.source.text[node.idx : node.endidx])

sparr commented 1 year ago

OK, so, the Token with "value":"function" is the literal function in the source, spanning idx 11-19.

Its parent is this Node:

{
    "body": { "py/id": 25 },
    "children": [ "CUT" ],
    "endidx": 47,
    "idx": 11,
    "kind": null,
    "name": "funcname",
    "params": [],
    "parent": { "py/id": 0 },
    "py/object": "pico_parse.Node",
    "scopespec": { "depth": 1, "funcdepth": 1, "items": {}, "parent": { "py/id": 19 }, "py/object": "pico_parse.Scope" },
    "source": { "py/id": 6 },
    "target": { "py/id": 12 },
    "type": { "py/object": "pico_parse.NodeType", "value": 15 },
    "value": null
  }

which spans idx 11-47. That seems like everything from function to end. That Node has "kind": null and "name": "funcname" (because it's a function with that name, not to be confused with the sibling of that earlier token that has "value": "funcname" and is the literal funcname from the source). It also has "scopespec": { "depth": 1, "funcdepth": 1, ... which seems to describe some stuff about the function. Is there any way to directly identify that this Node represents a function, or do I need to duck type it based on the presence of .scopespec.funcdepth or similar?

The parent of that Node is another Node which has the same idx/endidx but different properties. I think it might represent the whole source, minus any leading or trailing comments, although I'll need to experiment more to figure that out.

sparr commented 1 year ago

Aha! "type": { "py/object": "pico_parse.NodeType", "value": 15 },

pico_parse.NodeType.values[15] is "function"!

thisismypassport commented 1 year ago

Yeah, node.type == NodeType.function

Yeah, the parent of the function is the root node, it indeed spans the whole source minus any leading/trailing whitespace/comments

thisismypassport / shrinko8

Feature proposal: annotate source with token/character counts for functions #12