tweag / nickel

Better configuration for less
https://nickel-lang.org/
MIT License
2.41k stars 92 forks source link

Symbolic strings (Nix string contexts-like) #948

Closed yannham closed 1 year ago

yannham commented 1 year ago

Is your feature request related to a problem? Please describe.

Working on Nickel-nix and in general Nix integration (#693), we've been needing something like Nix string context.

String context is a way of implicitly and automatically attaching and combining metadata to string values (in the case of Nix, the dependencies that must be built before the paths present inside the string become valid). When interpolating strings with context inside another string, all the dependencies (the contexts) are combined. This feature is really useful to avoid specifying obvious dependencies explicitly (e.g. source files).

However, we don't want to implement Nix string contexts as it is, because it's pretty ad-hoc and Nix specific. We would rather like to have a more general mechanism, of which string context would just be an instance, that may be used for other domains (Terraform, Kubenertes, etc.), or different use-cases within Nix (IFD/recursive Nix-like).

Fundamentally, Nix string context are an overloading of string interpolation (and other string operations) to work on richer values than just string. Very schematically, Nix strings are rather {ctxt : Array Deps, value: Str}.

We've discussed the possibilities many times. Having a general ad-hoc overloading mechanism would be possible but pretty heavy (think trait/typeclasses, or even a very restricted form just for strings), with the usual problems of coherence, complexity for new users, etc.

In some way, Nix string context might be implemented armed with effects (#85), e.g. if we allow to perform effects at string interpolation. However such an effect system is still to be properly designed for Nickel, and effects handler would be implemented in Rust, as interpreter plugins, which make them rather heavy to implement and distribute. For something like Nix string context, that could be ok, as we would have to do it once and for all per target tools. It's still a long way to get there.

This issue makes a simple and lighter proposal that could achieve the same effect, but relies only on one language feature (very small) and otherwise pure Nickel library code. It also seems to be forward-compatible with performing effects at string interpolation.

Describe the solution you'd like

We propose to introduce a new form of strings, let's call them symbolic strings, and write them using the delimiters s%" and "%s. Normal strings with interpolation are parsed as a list of chunks, where one chunk is either a string literal or an interpolated expression. For example, "foo %{bar} baz" is represented as (something like) [Chunk::Literal("foo "), Chunk::Expr(..), Chunk::Literal(" baz")]. String chunks are then evaluated at runtime, when first encountered, and turned into an actual string.

Symbolic strings would be almost the same, but they would return the chunks as a normal Nickel expression, and wouldn't evaluate them further. For example:

s%"foo %{bar} baz"%s would just be equivalent to write

{
  tag = `StrChunks,
  chunks = [
    {tag = `Literal, value = "foo "},
    {tag = `Expression, value = bar},
    {tag = `Literal, value = " baz"}
  ]
}

(the shape of chunks is just an example, and up to discussion)

Then, the library consuming such a string, or even just the contract attached to the field, would be in charge of doing whatever they want with it. Typically, in Nickel-Nix, there already is a nix_string_hack function that can process this kind of list and produce an AST that is re-interpreted on the Nix side, reconstructing the contexts, thus giving the same automatic and implicit dependency management as in Nix. But it uses normal function calls and arrays, which is arguably not very nice to read. Here is an example of how it is used:

args = [ "-c",
  ([inputs.gcc, "/bin/gcc ", inputs.hello, " -o hello\n"]
   @ [ inputs.coreutils, "/bin/mkdir -p $out/bin\n"]
   @ [ inputs.coreutils, "/bin/cp hello $out/bin/hello"])
   |> nix.lib.nix_string_hack

Symbolic string would just be an alternative, better syntax for this expression, allowing to write:

args = [ "-c", s%"
  %{inputs.gcc}/bin/gcc %{inputs.hello} -o hello
  %{inputs.coreutils}/bin/mkdir -p $out/bin
  %{inputs.coreutils}/bin/cp hello $out/bin/hello
"%s, ]

Which is really not different from what you would write in Nix today.

The change on the language side is really minimal (interpolated strings are already parsed as chunks, we just need to transform them into a Nickel value). Because symbolic strings are just composite Nickel values, the only operation that is natively supported is interpolation (for example, you can't call string.length ~or ++ on them~). That being said, interpolation seems to be what you use 99% of the time, and string operations don't even make sense in some cases (such as knowing the length of a Terraform computed value like an IP). The library writers providing an "interpreter" for those strings may then export additional string manipulation functions if they make sense (in the case of Nix, we can know the path at evaluation time, so we may define and export more string primitives in the library).

Related approaches

In fact, this idea is very close to the quasiquote/unquote/unquote-slice mechanism of Lisp. Or, even more specifically, to the G-expressions of Guix, but with a more idiomatic Nickel string syntax (and probably a few unimportant differences, as in this proposal interpolating would probably be more like unquote than unquote-slice, that is we wouldn't automatically "flatten" the AST but let that to the library code).

aspiwack commented 1 year ago

It looks sensible. Regarding supported operations, while length is indeed not meaningful, a symbolic string is fundamentally an array, so you can have some equivalent of ++ on it (it's not unlikely that you'll want to).

A thing to note is that symbolic strings will probably not support contracts in a meaningful way in the cases (like Nix, if I understand correctly) where the symbolic chunks are meant to be evaluated outside of Nickel.

Finally, I know that s stands for “symbolic”, but I'm a little scared of s% because of how s kind of reads like “string”. And I see this as potentially quite confusing.

Radvendii commented 1 year ago

+1 to everything @aspiwack said.

In addition:

yannham commented 1 year ago

Regarding supported operations, while length is indeed not meaningful, a symbolic string is fundamentally an array, so you can have some equivalent of ++ on it (it's not unlikely that you'll want to).

True, if we have interpolation, we do have concatenation.

Finally, I know that s stands for “symbolic”, but I'm a little scared of s% because of how s kind of reads like “string”. And I see this as potentially quite confusing.

Oh for sure, I didn't want to think too hard about it and just write the issue down, but it's an awful name.

yannham commented 1 year ago

Is there a reason to have s%" "%s rather than just s%" "? Or even just %" "?

Nickel already has multiline strings written m%"/"%m. It's thus consistent to reserve xxx%" delimiters in the future for all kind of specific strings (we can think of raw strings, for example, or strings with language highlighting support in the editor, such as cpp-lang%", which would be like markdown's ```cpp. That being said, if your tag is more than one character, it starts to be annoying to having to repeat it at the end. We may be better off with xxx%"/"% (I think it's in line with what Rust does with ### for raw strings).

Yo do want a different delimiter than just " usually, at least for multiline strings and raw strings, in order to write normal double quotes unescaped inside.

In your example, presumably we can drop all the .outputPaths, right? Because the library code gets to decide how to use the results of evaluating the chunks, so it can use a derivation just like it can use a path.

You're totally right, I've updated the issue.

This seems like it's really just an array. Do we even need to tag the parts as Literal orExpression? The only difference seems to be that if it's tagged, you can have an `Expression that's really a string. But in that case would you ever want to do something with it besides splice it into the string directly?

This is a good question. I don't know. The proposed approach is more general, but I don't have right now an obvious example that could make use of that. Your proposition also enforces that s%"%{"foo"}"%s has to be the same as s%"foo"%s, which is arguably a natural thing to expect, while currently, the custom string interpreter could decide differently.

aspiwack commented 1 year ago

Also, if the delimiter is more than one character, do you repeat it at the end, or do you reverse it :smiling_imp:

Radvendii commented 1 year ago

Yo do want a different delimiter than just " usually, at least for multiline strings and raw strings, in order to write normal double quotes unescaped inside.

Oh of course. good call.

That being said, if your tag is more than one character, it starts to be annoying to having to repeat it at the end.

Also, if the delimiter is more than one character, do you repeat it at the end, or do you reverse it smiling_imp

Lol, yeah. I think just "% for closing might be a good call. Though My guess is "% might show up in strings sometimes, and is not a sequence people expect to have to escape. This shows up with '' and ${ in nix, too. This is getting into more general syntax bikeshedding though, so probably best to continue at another time / in another thread.

The proposed approach is more general, but I don't have right now an obvious example that could make use of that. Your proposition also enforces that s%"%{"foo"}"%s has to be the same as s%"foo"%s, which is arguably a natural thing to expect, while currently, the custom string interpreter could decide differently.

Yeah, that's what I was thinking. I don't know exactly what the right solution is. More general vs. enforcing that constraint. How easy would it be to change it out from under people once it's implemented? There would end up being a bunch of code depending on whatever format we choose. But also, it wouldn't be that hard to convert from one to the other.

yannham commented 1 year ago

Lol, yeah. I think just "% for closing might be a good call. Though My guess is "% might show up in strings sometimes, and is not a sequence people expect to have to escape. This shows up with '' and ${ in nix, too. This is getting into more general syntax bikeshedding though, so probably best to continue at another time / in another thread.

This is ok, you can do as in Rust raw strings, or cpp I think, which is to repeat % arbitrarily. If you write m%%" as an opening delimiter today, then "%%m is your closing delimiter, and m%"/"%m doesn't need to be escaped inside.

matthew-healy commented 1 year ago

I'm pro- including tags in the resulting Nickel values. It seems feasible to me that library authors might want to handle a literal value & the equivalent interpolated string differently, and this enables that at a very small overall cost. Removing the tags makes the feature less flexible for - as far as I can see - no real benefit.

If the desired behaviour of a specific implementation is for { tag = `Expression, value = "x" } to behave exactly like { tag = `Literal, value = "x" } then that's easily implemented with a simple type_of/is_string check against the value.

matthew-healy commented 1 year ago

For the opening identifier - I agree that s%"..."% is a little vague. I think it's fine for a first-pass at the implementation (and it's probably what I'll use initially) but we should definitely bikeshed alternatives before this feature becomes generally available.

As a related point in the design space: I like how Scala (3) exposes its StringContext custom interpolation feature. The example in the documentation gives a pretty good overview (here, under "Advanced Usage"). In particular, it's the library author who chooses the opening identifier for the interpolated string, and it's done at the same time as specifying how to interpret it.

Imagining a hypothetical future where something* like this was possible, and we could write nix%"..."% or json%"..."% I really like that:

  1. It's immediately clear what this string is for and how it will be interpreted.
  2. It's not possible to accidentally pass a symbolic string to the wrong parsing function.
  3. End-users get a simpler abstraction ("the nix library requires special nix%" strings" vs "some libraries require special s%" strings").
  4. It becomes the job of library authors to educate users on what magic strings their particular library requires & why.

yannham commented 1 year ago

Imagining a hypothetical future where something* like this was possible, and we could write nix%"..."% or json%"..."% I really like that:

It's a small detail, but someone suggested that we allow special strings for code highlighting in editor, a bit like in markdown. In this case we want to disambiguate e.g. if nix%" represents Nix code, or Nix interpolated strings. Maybe with a specific prefix/naming scheme.

A second point is that we may, in a distant future, want to mix both Terraform and Nix interpolation in the same string, like:

s%"
  %{inputs.lighttpd}/bin/server -p 80 -i %{resources.foobar.ip}
"%s

I imagine in any case you need a combined parsing function, but I wonder about the prefix for such string. Maybe it's fine to have nix-tf and any such combination because we aren't anticipating a lot of possible mixed interpolations.

Beside that, I like the idea. I just wonder how to select the parsing function, because there's currently absolutely no notion of program-wise declaration (somehow, all bindings are local) or name resolution in the language. Maybe the contract could be in charge of specifying that, and fail if the prefix doesn't match with the declared function. In that case the parser would interpret anything like foobar%" as a symbolic string if foobar isn't already a builtin prefix, and pass the name as an additional argument to the parsing function, which then has the additional possibility to fail if the name is not the expected one.

Doing so isn't fully satisfying though:

  1. Nothing prevents the same library to have two different parsing functions accepting the same kind of string: the parser would be declared very locally.
  2. The parsing function is in charge of checking the name, but might ignore it as well, allowing for both nix% and terraform% and blorg% strings to all be accepted and treated exactly the same way.
yannham commented 1 year ago

If the desired behaviour of a specific implementation is for { tag = Expression, value = "x" } to behave exactly like { tag =Literal, value = "x" } then that's easily implemented with a simple type_of/is_string check against the value.

I don't have a strong opinion, but I think the question is should we enforce that the law s%"%{"foo"}"% ~ s%"foo"% must hold, because it's natural to expect (like, say, a monad law). Meaning than a pure string value should never have a context.

Radvendii commented 1 year ago

(a) I think the possibility of someone accidentally violating that law is non-trivial. (b) it simplifies the representation, so that you don't first check the tag and then check the type to decide what to do with it, you just check the type. Maybe you could do that anyways, and just ignore the tag.

matthew-healy commented 1 year ago

I spent some more time thinking about this today. I'm broadly still of the opinion that differentiating between literal strings and interpolated values (regardless of their type) on a purely syntactic basis is my preferred solution. The mental model is obvious & it exactly matches the string as it's written. It's also how the StringContext abstraction functions in Scala.

I agree this opens the feature up to potentially unexpected implementations, but I think that's the responsibility of library authors using this feature to handle. (Much in the same way that in Haskell you can implement Monad however you like, but checking your instance obeys the monad laws is your own responsibility). I also don't think we can definitively say right now that it will never be useful to handle a symbolic string in a way that treats interpolated strings differently to literal ones, though I'll concede that I can't actually think of a use case. :sweat_smile:

Whatever we go with, it will be possible to build safer APIs on top too - e.g. a stdlib contract like strings.Symbolic a which ensures that every value is either a Str, or matches the contract a. In fact, we should probably do something like this whatever solution we choose, because it feels likely to be the most common use case for symbolic strings.


Another thing I think it's worth mentioning is that the question of how to represent strings is somewhat coupled to the question of whether or not to tag chunks. For example:

let lib = import("./lib.ncl") in
{
  # ...
  some_config_val = s%"[...] %{lib.constants.a_string} [...]"%, 
  # ...
}

If we do tag values, but we also want to treat strings differently to enforce that interpolating a string is always the same as writing the literal, then we need to evaluate lib.constants.a_string to decide its type in order to know how to tag the chunk.

This problem goes away if we just return Array Dyn, since we don't need to calculate anything based on the term and can just closurize the term into the resulting array's environment.


I'm not quite at the point of implementation where this decision needs to be made, but I'm also not far off. I'll probably just go with whatever's simplest for my initial implementation, but it shouldn't be too hard to change it if we decide to do so.

yannham commented 1 year ago

Much in the same way that in Haskell you can implement Monad however you like, but checking your instance obeys the monad laws is your own responsibility

I'm not sure this is a feature, as much as it's just that enforcing monad laws in undecidable (to do automatically) and impractical to prove today in Haskell. If Haskell had full-fledged dependent types, I think it wouldn't be out of the question for monads instances to be required to come with a proof of the monad laws. An unlawful monad is probably never what you want (you may want something Monad-like that doesn't respect the monad laws, but then you should probably not call it Monad).

Once again, I think the question is really: should such a law hold, because it's natural, it's what people would expect, or for good theoretical reasons (such as breaking this law would break important properties of the language)? If we decide so, I think we can enforce it easily.

If we do tag values, but we also want to treat strings differently to enforce that interpolating a string is always the same as writing the literal, then we need to evaluate lib.constants.a_string to decide its type in order to know how to tag the chunk.

True. I think the suggestion of @Radvendii is to not tag at all, meaning that the library function has no way to differentiate between literal strings, interpolated literal strings and computed interpolated strings (we can still perform "effects" on pure string, though, it's just that we have to treat them all the same). Then everything would probably have the type Array Dyn, indeed.

matthew-healy commented 1 year ago

I'm going to close this issue, as a version of symbolic strings has already been merged. There are definitely further discussions to be had about the shape of the feature, but it seems to make sense for those to happen in their own issues.