pomsky-lang / pomsky

A new, portable, regular expression language
https://pomsky-lang.org
Apache License 2.0
1.28k stars 19 forks source link

JSON output #63

Closed Aloso closed 1 year ago

Aloso commented 1 year ago

Is your feature request related to a problem? Please describe.

Tools such as IDE plugins need Pomsky's output in a machine-readable format. There are multiple ways to achieve this:

  1. Publish a native library that other tools can dynamically link to
  2. Let tool authors create the bindings they need themselves by using the pomsky crate and distribute it as part of their tool
  3. Make the CLI more powerful, so it can be used by tools more effectively

The solution I'm proposing is number 3 since I think it offers the best experience both for users and tool authors: Users need to have the CLI installed, but they don't have to download anything else, and tool authors don't need to update their tools every time a new Pomsky version is released.

Describe the solution you'd like

Add a --json flag to print all compilation results as JSON rather than plain text. The JSON is written to stdout on a single line. It's an object with the following structure:

{
  version: "1"  // schema version
  success: bool  // true if no errors occurred
  output? : string  // compiled regex, only present if the compilation was successful
  diagnostics: object[]  // array of errors and warnings
    severity: "error" | "warning"
    kind: string  // e.g. "parse" for parse errors or "compat" for compatibility warnings
    code?: string  // a 4-digit error code starting with "P", e.g. "P0105"
    spans: object[]  // source code locations that should be underlined
      // initially this array will always contain exactly one object
      start: int  // start byte offset, inclusive
      end: int  // end byte offset, exclusive
      label? : string  // optional additional information to this span
    description: string  // explanation of the error or warning
    help: string[]  // additional information to help the user fix the issue
    fixes: object[]  // "quick fixes", automatic source code transformations; may be displayed as a light bulb
      // since this feature isn't implemented in Pomsky yet, this array will always be empty initially
      description: string  // text to display 
      replacements: object[]  // array of source code locations that need to be modified
        start: int  // start byte offset, inclusive
        end: int  // end byte offset, exclusive
        insert: string  // text to replace the source code location with
  timings: object
    all: number  // microseconds; time it took to compile, not including argument parsing and disk I/O
}

Example when running pomsky '\b' --json

{
  "version": "1",
  "success": false,
  "diagnostics": [
    {
      "severity": "error",
      "kind": "syntax",
     "code": "P0003",
      "spans": [
        { "start": 0, "end": 2 }
      ],
      "description": "Backslash escapes are not supported",
      "help": "Replace `\b` with `%` to match a word boundary",
      "fixes": []
    }
  ],
  "timings": { "all": 40 }
}

Byte offsets are zero-indexed, e.g. the first byte has the span { "start": 0, "end": 1 }.

The input is always treated as UTF-8, so the characters in the string a💩ø have the spans { "start": 0, "end": 1 }, { "start": 1, "end": 5 } and { "start": 5, "end": 7 }.

Describe alternatives you've considered

There are other formats than JSON, like XML and MessagePack, but they're less widely supported, and JSON should be fast enough.

Instead of reporting byte offsets, Pomsky could return row and column positions. However, that requires more work, and may not even work for every tool, since different IDEs have different plugin interfaces. Furthermore, UTF-8 has an exact definition, whereas rows and columns aren't well defined in the context of Unicode.

Future possibilities

We can keep the CLI running in the background, so tools can write to stdin and get a response at stdout, without having to spawn a process each time.

We can add a --check flag that returns diagnostics but doesn't actually compile the expression, which should be faster. This could then be used to report errors on every keystroke.

lppedd commented 1 year ago

Will have to check if IntelliJ supports the way you plan on reporting offsets.
I'll probably need to find a way to map the byte offsets to editor offsets.

Why not use a UTF-8 char offset instead of byte offset?

Aloso commented 1 year ago

Because it's expensive to convert, and I'm not even sure this would be more useful for everyone. Some tools using the JSON output might need UTF-16 code units instead, or grapheme clusters, or rows and columns. The conversion to code point indices would be wasted when users have to convert them to something else again.

Aloso commented 1 year ago

@lppedd this is now implemented on the master branch. You can try it out by installing Rust and running cargo install --path pomsky-bin. I won't release a new version right away, I first need to write tests for the new functionality.