Optional language annotation for multiline strings

stelcodes commented 1 year ago

Is your feature request related to a problem? Please describe. Within Nix code, there is often large chunks of text consisting of embedded code in multiline strings. These strings are very difficult to syntax highlight properly by a text editor, resulting in large blocks of code with no syntax highlighting and no possibility of other helpful editor features such as linting and autocompletion. This makes writing embedded code tedious. Text editors like Neovim can provide excellent language support for embedded code snippets when it can determine the code type (for example, fenced code blocks in Markdown documents). Having an optional language annotation for multiline strings would make editing embedded code significantly easier.

Describe the solution you'd like An optional language info string after the initial '' for multiline strings like this:

let neovim_config = ''lua
  -- Turn on incsearch
  vim.opt.incsearch = true
'';

Describe alternatives you've considered I don't think there's any other obvious way of implementing this feature.

Additional context Inspiration coming from Markdown's fenced code block's info string which allows embedded programming languages to be properly syntax highlighted.

bew commented 1 year ago

In case you didn't know, the vim-nix plugin has an alternative for Nix with this PR (never merged but used here and there): https://github.com/LnL7/vim-nix/pull/28 This approach feels a bit out of place but does not need language modification, using an inline comment just before the string:

let neovim_config = /* lua */ ''
  -- Turn on incsearch
  vim.opt.incsearch = true
''

stelcodes commented 1 year ago

@bew Oh very cool! Looks like there is a lot of interest and hopefully that gets merged soon. I still think a native language annotation would be a killer feature for Nickel.

yannham commented 1 year ago

For the record, this has been on the back of our head since the introduction of multi-line strings, and that's why we used prefixes like nix-s prefix for symbolic strings instead of just nix, to avoid confusing "this is a Nix symbolic string" with "this is a standard string with Nix code inside" (the latter being what you are proposing here).

At this point, I think the only thing left to do is to bikeshed the prefix of such strings. As for symbolic strings, I would vote for a prefix format which both the lexer and humans can immediately classify as "oh this is a string which contains code to be highlighted as XXX" without even knowing the language XXX, such as xxx-lang (bash-lang or lua-lang for example, instead of just bash or lua). The -lang might not be the best suggestion, it's just to give an example. Using simply lua or bash would be more problematic, because of possible confusion with symbolic strings or other string prefixes (in the future we might want to add for example raw strings with the raw prefix, but what if there's a raw language?).

Your example would become something like:

let neovim_config = lua-lang%"
  -- Turn on incsearch
  vim.opt.incsearch = true
"%

matthew-healy commented 1 year ago

I guess we should probably open a bikeshedding issue for this, but I personally quite like something like lang:{name} as a prefix. So e.g.:

let neovim_config = lang:lua%"
  vim.opt.incsearch = true
"% 
in ...

My reasoning is mostly:

this keeps the information in the tag ordered from most general to most specific,
Using the : separator here rather than - also makes the separation clearer for language names that need to use - for spaces, e.g. lang:obj-c.

The downside to this is that it would also suggest we rethink the current symbolic string prefix format, as using, e.g. nix-s for symbolic Nix strings, but lang:nix for embedded Nix code feels weird. I guess if we decided to do this we could also support a similar s:nix format for tagged symbolic strings, while maintaining support for nix-s until Nickel 2.0.

matklad commented 1 year ago

I guess this could be a contract?

let neovim_config = ''
  -- Turn on incsearch
  vim.opt.incsearch = true
'' | lang "lua";

that way, you can even attach contract to a record field, and all call-sites could get syntax highlighting automatically, given a sufficiently smart language server and editor (I think today sadly only IJ supports type-driven injections)?

matthew-healy commented 1 year ago

@matklad Interesting suggestion! Currently we have a distinction between contracts - which generally result in runtime assertions against values - and metadata - which annotate values/fields/etc. with extra info which the LSP has access to. (This distinction is probably not clear enough in the documentation right now).

This could likely work as a metadata annotation, though that opens us up to the possibility that the annotation gets attached to non-String values. E.g. we need to handle the case { neovim_config | lang "lua" = 1 }. One benefit of using String prefixes for this is that they can only ever prefix a String, so we avoid having to handle these additional errors.

(Aside: the syntax for multiline strings in Nickel is m%"..."% and let expressions are always of the form let $ident = $expr in $expr, so the '' syntax and trailing ; that are showing up in some of these examples aren't valid code, though I appreciate that's not really important for this discussion.)

matklad commented 1 year ago

🤔 if this is a contract, we presumably can express that it “inherits” from String contract, so that { neovim_config | lang "lua" = 1 } gets an apparent type error.

And I guess we could even implement optional syntax checking in a contract (not implying that we should). This could be fun to verify that, e.g., syntax is valid after string interpolation.

ISTM there are three bits of data here, which we usually want to apply in bunch, but sometimes might want to separate into separate annotations:

String type
lang metadata (purely for highlighting)
lang contract (actual runtime syntax checking)

matthew-healy commented 1 year ago

I'm going to try and share my thoughts here, but we're reaching the limits of my personal knowledge of how metadata annotations work currently, so @yannham (who is currently on vacation) may want to correct me on some of the details later.

It's true that if it's a contract we can express that the input value is a String. My point was mostly that we don't have any current examples of contracts which pass metadata to the LSP, and doing so is non-trivial. Ultimately contracts are just functions Label -> Value -> Value, which may or may not terminate the program. That's not to say this could never happen, but that doing this feels to me like a pretty sizable change.

It might be feasible to have a metadata annotation which generates a contract at runtime, but it's worth noting that there are differences in where contracts & metadata can be applied (see RFC005). That means that if we used a metadata annotation, regardless of whether or not it also applied a contract, we'd lose the ability to annotate both arbitrary values and identifiers. I.e., your initial example:

let neovim_config = m%"
  -- Turn on incsearch
  vim.opt.incsearch = true
"% | lang "lua"
in ...

would no longer work, as the lang annotation would be restricted to record fields. Again, that's not to say this could never happen, but rather that the barrier to making something like this work is higher than the proposed string literal prefix solution, which is already partially implemented due to the "symbolic strings" functionality.

I do really like the idea of being able to check whether embedded code is valid, but it seems to me that'd require some sort of effects/plugins system to work, which is definitely something that's been discussed, but not something that's on the roadmap for the near future. I think it makes sense to focus on an approach for syntax highlighting first and then we can discuss these sort of extensions once that's done.

stelcodes commented 1 year ago

Wow it's great to see interest in a feature like this! While metadata and contracts seem like a cool feature ideas, I was imagining more of a basic, pure-syntax implementation that has no effect on the language runtime. This would be much, much easier to parse and extract with regex (or something a little more advanced like treesitter) which most editors are limited to for syntax highlighting functionality.

After getting a bit more familiar with Nickel syntax, these are the changes I would make to the language to accommodate for this feature and simplify string syntax:

(Please note, I'm aware these would require massive breaking changes and are a bit ridiculous. I've probably spent 1/10000th of the time thinking about Nickel and language design as y'all have. I just want to share!):

Remove m prefix for multiline strings, use immediate newline character after last % to determine if multiline indentation should be applied.
```
let a = %"
The two spaces before "The" get removed.
"%
```
```
let b = %"  The two spaces before "The" are not removed."%
```
Symbolic strings identified by non-alpha symbol like a bang !:
```
let x = !%"
some symbolic %{string}
"%
```

Language annotations can optionally be used as a prefix before a string like so:

let y = lua%"
local t = { 'some', 'lua', 'code' }
"%

let z = lua!%"
local t = { 'some', 'lua', 'code', 'with', %{symbolic-expressions} }
"%

milahu commented 1 year ago

ideally allow passing compiler options for intellisense

python?version=2
typescript?jsx=react&target=ES2020
javascript?env=node
cc?std=c++20

yannham commented 1 year ago

I mostly agree with what @matthew-healy said. Having contracts for checking external languages syntax would be awesome, but it requires much more machinery to come to life. On paper, we can already implement custom contracts, and thus could in theory write a Ruby parser (or whatever) in pure Nickel. However, it would probably be a difficult experience, and be quite inefficient: while simple parsers can (and should!) be expressible, parsing a full fledged programming language isn't really what Nickel was made for.

Another solution would be to have those contracts be builtin, or at least leveraging external code (in Rust) or even an external tool (like tree-sitter). This would correspond to the plugin/effect approach, and would be an interesting avenue to explore. I think CUE, for example, which has much more limited expressive power, permits to offload some validation or computations to arbitrary external Go code. Last time I spoke with CUE's creator, he said we wanted to move to WASM. While much more things can be done in pure Nickel, as it is a turing complete language, maybe we don't want to re-implement a hundred parsers in a language that isn't exactly general-purpose either. But this requires to work out a plugin/effect system first.

For now, it seems a simple string prefix would already come a long way. We can always add contracts later (although it might become a bit redundant then to have both a contract and a prefix?). With respect to the LSP, I also would like to point out that we are much less limited than with the interpreter. We can do additional work, heuristics, or change the behavior without it being a language breaking change. We could very much detect the usage of a hypothetical new builtin contract std.contract.Lang 'Ruby and automatically check the corresponding String, I don't think this is that difficult. Fancier contract application could go under the radar, but we don't need to be always 100% accurate, as long as we're not wrong and cover most of the basic usages.

With respect to the prefix, I agree that in retrospective, the -s isn't optimal. That being said, we can always decide to convert names to snake_case, or just accept dashes as well, such that Objective C would be obj-c-lang or obj_c-lang. I agree that the lang: syntax is prettier, but honestly this doesn't sound very important either. I would rather not already make a breaking change and bump to 2.0 for that, wait for 2.0 to add this feature, or introduce inconsistencies in string prefixes by having lang:c but nix-s. Introducing different ways of doing the same thing also doesn't sound great to me, while we've just released 1.0.

But, symbolic strings are currently probably only used within Nixel. Maybe we can afford to switch the syntax to s: (or sym), still supporting -s until 2.0 while secretly hoping that once we've changed every occurrence in Nixel and in the documentation, virtually no one will still be using the -s syntax?

Radvendii commented 1 year ago

One advantage of the | lang "foo" annotation idea is that it doesn't have to be specified at point of use, it can be specified when the contract is defined. For instance, in nixpkgs' mkDerivation, buildPhase, etc. could specify that they're bash strings and you'd get syntax highlighting for free, without specifying it every time.

As for using it on non-strings... can we have | lang "foo" imply | String? Some kind of desugaring step.

Or it could just be a no-op on non-strings, and leave it up to a linter to complain, kind of like how (IIRC) you can specify | doc "foo" on a let-binding even though that's useless.

An argument against the string prefix is that it implies intuitively to me that there's something semantic going on, especially because m%, nix-s%, raw% (when it drops) do change how things are parsed, interpreted, etc.

tweag / nickel

Optional language annotation for multiline strings #1310