Implement raw string sigils for CLI field customization.

tweag / nickel

Better configuration for less

https://nickel-lang.org/

MIT License

2.43k stars 93 forks source link

Implement raw string sigils for CLI field customization. #2068

Open vi opened 1 month ago

vi commented 1 month ago

^ to use literal string without interpretation
@ to read string from file
% to read string from environment variable

Example:

nickel-lang-cli  export q.ncl  -- big_text_field=@myfile.txt path_envvar=%PATH just_string=^qwerty

Related to #2043.

Open questions:

Are sigils the right way? They may be not very scalable, e.g. if we want to specify json files it won't be as straightforward.
- If yes, which symbols to use for what (and which ones to reserve for further additions)?
- If no, what to use instead?
Maybe specifying Nickel expression (expect of just numbers) should have heavier syntax compared to raw strings and other passive content?
- Seemingly trivial nickel-lang-cli export ... -- the_number=123 may lead inaccurate user to the_number=$UNTRUSTED_USER_INPUT, where the user would assume that it would just fail if it is not a number.
Shall they be parsed from grammar.lalrpop, just like regular assignments? If so, I am uncertain how to design it properly.
Shall discriminating between those types be done earlier the result be encoded in FieldOverride (that may gain something like FieldOverrideValue enum)?
Shall files read by @ go though the cache?
What to do with errors? Just reuse IOError or extend the main Error type?
Direct number

github-actions[bot] commented 1 month ago

Bencher Report

Branch	2068/merge
Testbed	ubuntu-latest

⚠️ WARNING: The following Measure does not have a Threshold. Without a Threshold, no Alerts will ever be generated!
Latency
Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds CLI flag.

Click to view all benchmark results

Benchmark	Latency	nanoseconds (ns)
fibonacci 10	📈 view plot ⚠️ NO THRESHOLD	488,400.00
foldl arrays 50	📈 view plot ⚠️ NO THRESHOLD	1,821,900.00
foldl arrays 500	📈 view plot ⚠️ NO THRESHOLD	23,810,000.00
foldr strings 50	📈 view plot ⚠️ NO THRESHOLD	7,016,800.00
foldr strings 500	📈 view plot ⚠️ NO THRESHOLD	61,943,000.00
generate normal 250	📈 view plot ⚠️ NO THRESHOLD	40,043,000.00
generate normal 50	📈 view plot ⚠️ NO THRESHOLD	1,806,400.00
generate normal unchecked 1000	📈 view plot ⚠️ NO THRESHOLD	58,861,000.00
generate normal unchecked 200	📈 view plot ⚠️ NO THRESHOLD	2,890,200.00
pidigits 100	📈 view plot ⚠️ NO THRESHOLD	3,152,900.00
pipe normal 20	📈 view plot ⚠️ NO THRESHOLD	1,484,500.00
pipe normal 200	📈 view plot ⚠️ NO THRESHOLD	12,034,000.00
product 30	📈 view plot ⚠️ NO THRESHOLD	836,050.00
scalar 10	📈 view plot ⚠️ NO THRESHOLD	1,508,200.00
sum 30	📈 view plot ⚠️ NO THRESHOLD	809,790.00

🐰 View full continuous benchmarking report in Bencher

yannham commented 1 month ago

I personally think sigils are a good way forward; they're simple, easy to add, and there's precedent of their use on the CLI. My personal preference would to not duplicate them too much, at the risk of having a cryptic CLI that resembles a Perl expression: for example we could just use @ everywhere, but have a kind of URL/protocol specification that is maybe slightly less succinct but easier to understand. For example: @file:foo.txt, @env:ENV_VAR, etc.

We could have sensible but simplistic default: if nothing is specified, it's a file, and we use the extension to deduce the format. We probably also want a way to specify the format explicitly if needed, so that we can write @file:text:foo.txt. This also solves the issue of how to get a Number directly instead of a string: if an environment variable store a number that we want to interpret as such,, we could do @env:nickel:ENV_VAR. The default would be text, but this can be discussed.

Shall they be parsed from grammar.lalrpop, just like regular assignments? If so, I am uncertain how to design it properly.

This isn't required, if the parsing is really trivial. I think field paths on the CLI as in foo.bar.baz in nickel eval config.ncl -- foo.bar.baz=5 used to be parsed separately. But field paths are a bit more involved - you can have escaping, so stuff like foo."bar.baz".balrg, the code was already existing the grammar as Nickel needs to parse such field definitions as well. So it made sense to avoid duplication and do everything in the grammar.

This is not the case here though, as the syntax we propose isn't valid Nickel. So as long as we know exactly what characters we expect and we don't have any kind of stack to maintain like for well-bracketed expressions, parsing the sigil manually is most probably ok.

For the rest of your questions, I think @foo.txt should behave exactly as sugar for foo='import "foo.txt"'. We can even make it actual desugaring: that way, we don't have to duplicate any logic or think too hard about each detail. We can't really do that for environment variables, but this can be a guide: if for any question, the answer is probably how would that behave if we put the same content in a file and imported it instead.

vi commented 1 month ago

if nothing is specified, it's a file, and we use the extension to deduce the format

if an environment variable store a number that we want to interpret as such

What is security standpoint for Nickel in general? Should be in general protect against mixing code and data?

Shortcuts like @$MY_FILE meaning @file:text:$MY_FILE can lead to surprising injections where MY_FILE variable can be bent to contain a prefix like literal:nickel:evil_code instead of a file name.

Should Nickel strive to make secure way the easiest and avoid easy to code, superficially working, but unsound pitfalls? Filename extension sniffing with a fallback from passive json/yaml/text to active Nickel code is already a step in the wrong direction if it does.

exactly as sugar for foo='import "foo.txt"'

It would be tricky to just transform @foo.txt to import "foo.txt" reliably, as foo.txt may contain embedded " or %{}s. Escaping, then unescaping looks ugly and can also be a security issue if done poorly.

Proper way also opens a path to support non-Unicode filenames, to allow any files to be specified e.g. on Linux, like -- my_field=@file:text:$'ABC\xFF\xFE\xFF.txt'. Such filenames are cumbersome to round trip through a Nickel code fragment.

vi commented 1 month ago

But field paths are a bit more involved - you can have escaping, so stuff like foo."bar.baz".balrg, the code was already existing the grammar as Nickel needs to parse such field definitions as well.

Currently implementation does reuse proper parser for the part on the left side of = character.

It does assume that = cannot happen in the field name part. Is it the case?

vi commented 1 month ago

Do I correctly assume that current way of specifying fields on CLI is not stable, so we break backwards compatibility and e.g. require some additional prefix to specify a nontrivial Nickel code as a field value?

yannham commented 1 month ago

Shortcuts like @$MY_FILE meaning @file:text:$MY_FILE can lead to surprising injections where MY_FILE variable can be bent to contain a prefix like literal:nickel:evil_code instead of a file name.

It's important to note that Nickel is entirely pure at the time, and can't dynamically load files. So all an evil file or Nickel snippet could do is...to produce a value (well, or loop indefinitely, there's that). While it's a valid point, code injection isn't the same for Nickel as for say for python.

On one hand I'm tempted to say that it's the user's job to use @file:text:$MY_FILE if $MY_FILE is untrusted content. On the other hand you're right that we should make the default, simplest approach the safest. One possibility is to always require the file part explicitly. Or to chose a syntax where the short one can't be a prefix of the longest one, in some sense. Anyway the @file:text was just a suggestion: I just want to vouch for readable syntax rather than having 5 ad-hoc Nickel-specific sigils. I hope we can find a syntax that satisfies both readability and avoid potential injections.

It would be tricky to just transform @foo.txt to import "foo.txt" reliably, as foo.txt may contain embedded " or %{}s. Escaping, then unescaping looks ugly and can also be a security issue if done poorly.

I'm not sure to understand this part. Are you talking about the filename that can have embedded " or %{}s, or the content of the file? Note we don't have to go through a round trip of actually printing and parsing Nickel code. We're in the interpreter, so we can generate an Import AST node directly without caring about escaping. What I mean is that the semantics should be the same as if we did the round trip, not that we have to actually do it exactly as text substitution. That should still answer all the questions like "should this go to the import cache" and whatnot.

Currently implementation does reuse proper parser for the part on the left side of = character. It does assume that = cannot happen in the field name part. Is it the case?

Ah, this is a good point. I was thinking about parsing just the sigil part (the value) but we still have to find the value part and it can be non-trivial: indeed fields can contain = as in foo."bar=baz?".value=true. So maybe it's safest to do it in the parser. Or maybe we can just take the last =?

Do I correctly assume that current way of specifying fields on CLI is not stable, so we break backwards compatibility and e.g. require some additional prefix to specify a nontrivial Nickel code as a field value?

In theory, yes. In practice, this means like we need to prefix stuff like true, 5, etc. So we have to weight each possibility and see if it's really worth it. Even if it's unstable, people are still probably using it, so breakage must be justified.

vi commented 1 month ago

So all an evil file or Nickel snippet could do is...to produce a value

It can include arbitrary file from a filesystem, which may be unsandboxed. That value may unexpectedly contain secrets.

Are you talking about the filename that can have embedded " or %{}s, or the content of the file?

The filename. Only \x00 and / cannot happen in a filename on Linux.

Or maybe we can just take the last =?

What if the filename or a literal string contains a =?

Maybe CliFieldAssignment should not parse both Vec<LocIdent> (the left-hand part) and RichTerm (today's right hand part) in one go, but do it two steps: firstly just Vec<LocIdent> with = and some arbitrary string chunk after =, then (with the appropriate prefix to indicate arbitrary Nickel expression) it would be a RichTerm?

require some additional prefix to specify a nontrivial Nickel code as a field value

In practice, this means like we need to prefix stuff like true, 5, etc.

5 and true counts as a trivial Nickel code. Nontrivial means there can be imports, lets, string interpolation, etc.

If we go the grammar.lalrpop way then we can discriminate between trivial terms and complex ones.

yannham commented 1 month ago

What if the filename or a literal string contains a =?

Maybe CliFieldAssignment should not parse both Vec (the left-hand part) and RichTerm (today's right hand part) in one go, but do it two steps: firstly just Vec with = and some arbitrary string chunk after =, then (with the appropriate prefix to indicate arbitrary Nickel expression) it would be a RichTerm?

Hmm. Parsing arbitrary stuff is annoying to do in the current parser infrastructure, because the lexer still produces tokens that are tailored to the Nickel language. It's not impossible to do, but might prove annoying - requiring something like a new lexer mode that is toggled by the parser. Maybe the simplest is to add lexer rules for stuff like @.... instead.

In general I don't see how a prefix for complex expressions is going to help though, because if you do foo=$VAR then $VAR could just include this prefix. You probably want the converse - have a prefix to restrict what comes after, but then it becomes the user's responsibility to add it, which is a problem, because the natural foo=$VAR becomes dangerous (beside files, as you mention, I reckon it could also contain something like @env:PATH or @env:SECRET).

Maybe what we should do is the following: by default, restrict everything that can be given as a field value on the CLI. That is, we accept string literals, numbers, bools etc - all literals, and maybe arrays and records of literals - and that's it. It's easy to check after parsing, we can just validate the right hand side.

Then for anything more involved, like foo.bar='import "other.json"', Nickel would require the usage of additional flags or permissions - like --allow import, and --allow expression for more complex expressions. There are small design questions around if you want those flags to apply to one field only, or all, or flags for both cases.

For the sigils, maybe not having shortcuts is the way forward: you always need to write @file: and @env:, which isn't too bad. And we can have a different separator for the format so that it can't be injected, for example @file/json:foo.txt. So when you do @file:$MYVAR, MYVAR can't change the protocol at all.

@ is used for array concatenation in Nickel, but if we chose a different character, we can also put the protocol before, which would solve the injection issue similarly. Say we use ~ as a sigil (we don't want to, but for the sake of the argument). Then the syntax could be file~foo.txt, file:json~foo.txt, and ~$MY_VAR would be a safe shortcut.

vi commented 1 month ago

foo."bar=baz?".value=true

How important are those field names containing =? Maybe it can be worth to just limit CLI to simpler, reasonable field names and require a workaround (e.g. a small Nickel file that copies one field to another) to set tricky names?

yannham commented 1 month ago

How important are those field names containing =? Maybe it can be worth to just limit CLI to simpler, reasonable field names and require a workaround (e.g. a small Nickel file that copies one field to another) to set tricky names?

I don't expect that to happen a lot. But I would still be a bit disappointed that an implementation detail (it's annoying to parse) leaks to the interface and someone somewhere on day will get a failure because they use a = in their field name and won't understand why.

Thinking about this again, in fact we could find the right = by leveraging the lexer. The lexer is indeed able to handle strings properly, so we could lex the whole thing and extract the first normal = token that we find (which will provide a span). This = must be the one of the assignment, as strings are lexed as separate tokens.

vi commented 1 month ago

Thinking about this again, in fact we could find the right = by leveraging the lexer.

I also though about it.

the whole thing

Can lalrpop parse prefix of a string; not expect EOF, just stop parsing when a specified token is found?

This can be used even to handle non-Unicode filenames that cannot be put in a String - remember proper byte position of the =, then apply it to original OsString argument to get the filename reliably. Should I try implementing it (introduces complexity) or avoid handling this for now to keep things simpler?

yannham commented 1 month ago

Can lalrpop parse prefix of a string; not expect EOF, just stop parsing when a specified token is found?

Not that I know. LALRPOP just pops tokens from the lexer until it can't parse stuff anymore, to the best of my knowledge. But honestly the more I think about it, the more I think the lexer is the easiest: it's cheap (lexing should be pretty fast), it's simple (take until =), it doesn't need to modify the grammar or to do strange LALRPOP hacks...and if there is no sigil, we can just fallback to parsing it again entirely in a normal way. There is no way the cost of lexing the field path two times is even measurable, I suspect.

And if we think that the sigils are the exception and that normal field assignment should be the fast path, we can do the converse: try to parse, if there is an error, restart a lexer and try to sigil-parse, and if both fails report the original parse error. But I'm not sure this will make any difference, honestly.