JSON comments-- again - Githubissues

mewalig commented 6 years ago

I know this issue has been discussed before and the view of the jq team is that this should be in an external tool. I've written a poor external tool to show that some of this can be done externally, but it runs into limits that I don't believe can be overcome as an external tool (without re-implementing all of jq which isn't really an external tool then). So, I'm opening a new issue to specifically discuss:

types of comment support
the pros/cons of an external tool
an example of an external tool that you can find at https://github.com/mewalig/json-comments
limitations of an external tool and potential solutions

The two reasons I wrote the external tool were a) to provide a quick and dirty hack for limited use cases and b) to explore the limits of what can be done using an external tool.

I understand there are differing views on whether JSON should support comments at all. This discussion is targeting use cases where we have already decided that having comments in JSON would be useful and are now evaluating what options are available to support them.

There are many characteristics and flavors of comment support, including:

comment syntax: C-style, C++-style, other. If we can support one, it should be trivial to support others
comment placement: separate line, same line, etc. I think the main question here is, given an original ugly format that contains comments, where to place the comments when the format is prettied
comment existence: if the json is filtered, whether / how comments should be output

Regarding comment placement and existence, I would propose that the comment is "attached" to its immediately-preceding token in the input JSON. If/when that corresponding token is output, the comment is output with it. This would apply when the JSON output is filtered as well.

As for pros/cons, I think the most important "pro" is to keep jq focused strictly to the JSON spec (and therefore keep the code base etc streamlined), and the most important "cons" are:

it would be difficult (maybe impossible) to use jq as a filtering tool and still properly handle comments
jq is very popular already, but even so, I suspect that adding comment-parsing capabilities would make it significantly more compelling for many of its existing and would-be users

I have posted a partial proof-of-concept tool at https://github.com/mewalig/json-comments. All it does is strip comments out, save them to a file, and then add them back in. In between those steps, jq can be used to pretty-print the output, so the end result can be pretty-printed output with comments. Because it is proof-of-concept, it only supports ASCII and C++-style comments, but those limitations are easy enough to remove and what I'd classify as trivial.

The most important limitation is that it only adds comments back into equivalent JSON. In other words, it can't be used with a jq filter other than e.g. "."-- obviously this is a big limitation.

Which brings us to the final topic: limitations of an external tool and potential solutions. Some potential solutions:

build filtering capabilities into the external tool. This is a bad idea as it would just recreate what jq already does so well.
create some "add-on" functionality for jq to work together with the external tool. For example, jq and or the external tool could create some data files or communication channels that jq and/or tool use that ultimately result in some filtered output that contains comments. The problem with this approach is that it's likely to be more work than building into jq, because there would need the additional work of exchanging data between jq and the external program.
build the capabilities entirely into jq. I actually don't think this would be that hard. It could work as follows:
- struct jv is augmented to support comment information associated with that value
- whenever a comment is read from input, it is saved with most recently-read jv structure
- whenever that jv structure is output, the comment is also output. to keep it simple, the output could be C-style so that it can come in the middle of a line and e.g. before any trailing comma

It has been proposed in other discussions to use an external tool, but I don't see how that could possibly work (or be less work) given the above issues, unless we are content to give up the combination of filtering + comment support.

Thoughts?

nicowilliams commented 6 years ago

Well, if I absolutely had to do this, I guess I'd:

add support for multiple input formats
add a "hidden" object/array to every object/array where comments are stored
add syntax for getting the "hidden" object/array of an object/array

Something like:

$ jq -c --input-format JSON-with-comments '. as $dot | comments as $comments | keys_unsorted[] as $key | {key:$key,value:($dot[$key]),comment:($comments[$key])}' <<EOF
{"foo":"bar" /* comment #1 */, /* comment #2 */ "bar":"baz"}
EOF
{"key":"foo","value":"bar","comment":" comment #1 "}
{"key":"bar","value":"baz","comment":" comment #2 "}

You'll notice right away that this isn't good enough. What about an input like {"foo":/*a comment*/ /*another comment*/ "bar" /*more commentary*/ /*more still*/, /*moar*/ "bar":"baz"}?!

Ok, so the "hidden" comments object/array needs to store arrays of comments. Oof.

I mean, it's getting complicated.

Can we simplify? Do we even have a specification for what JSON-with-comments is? Are we sure that whatever that spec is, no one will come along and ask for some variation we didn't think of? Where do we stop?

Next issue: what if we need to select an input format per-input file? Do we want to derive the format from a file extension? Now things are really getting complicated. It's kind of a morass.

If you want to flesh this out and send a design and PR, we can look at it and maybe compromise. But I have no energy and no desire to do any of this work at this time, so unless you're willing to put in the work knowing that we might ultimately reject it, well, then I think jq is just not the tool for you unless your external tool can get good enough to deal with it.

Can you just... bug people who are using JSON-with-comments and get them to stop and adopt a schema that embeds commentary more like:

{ "foo":"bar", ".comment.foo":"a comment on the 'foo' key" }

nicowilliams commented 6 years ago

Moar crazy inputs:

$ cat file
// a comment
/* another */
[/* what is this? */ /* and this? */ null]

mewalig commented 6 years ago

Will respond re technical question separately. Re the question of whether we can bug people to incorporate comments into "proper" JSON: I understand that not including comments in the JSON spec is not a jq decision. However, to instead bug everyone to use the workaround, that's kind of like saying to all your programmers: "our compiler doesn't support comments, because the people who designed C++ decided not to, so when you write that code please put all comments into dummy string variables." Totally doable, and reasonable given the spec-- but sometimes the specs leave a lot of value on the table. Just like compilers often implement extensions that are not required by the corresponding programming language's formal spec (and which often subsequently then get incorporated into the spec), I'd think there's much value in having jq offering a "comment" extension even though it's not part of the JSON spec.

For example, consider the below-- which is a poor example that I only gave 30 seconds of thought to but hopefully is illustrative enough-- and how much the use of the fake comment approach severely waters down the value of comments.

This commented version:

{
    "protocol": "http",          // use "http" or "https"
    "host": "example.com",
    "whitelist": [               // specify one or more IP addresses here
       "111.222.333.444",        // can use use ip4v
       "2001:0db8:85a3:0000:0000:8a2e:0370:7334"  // can also use ipv6
    ]
}

is just a WHOLE lot more presentable and understandable than:

{
    "protocol": "http",
    "protocol_comment": "use \"http\" or \"https\"",
    "host": "example.com",
    "whitelist": [
       "111.222.333.444"
    ],
    "whitelist_comment": "specify one or more IP addresses here",
    "whitelist_111.222.333.444_comment": "can use use ip4v",
    "whitelist_2001:0db8:85a3:0000:0000:8a2e:0370:7334_comment": "can also use ipv6"
}

One of these (the former) is clear enough to put into technical documentation, verbatim. The other is not, and you can't really say to your technical doc reader, "well the reason we do this weird comment thing is that the JSON spec doesn't support any other way, so the onus unfortunately gets passed on to you to suffer the consequences".

If we forget about decisions of the past and ask ourselves: when we want to communicate to another human being what our JSON is doing, would there be a better way to do so than using JSON with some comments in it? Maybe yes in many cases, but for many many cases, personally, I would say the answer is no, and I would guess that many would agree, for the same reason that we see (and that best practices recommend) code snippets everywhere that have embedded comments.

nicowilliams commented 6 years ago

@mewalig I'm sorry but that response is not helpful. JSON is a standard. There's NO standard for JSON-with-comments to my knowledge. A helpful response would point to such a standard.

Second, people who make up their own JSON extensions and then rely on them are making a terrible mistake if they don't provide a way to get valid, unextended JSON out.

Third, JSON is NOT a document format. I would not advise editing JSON-encoded data by hand. I don't really care how ugly it looks to express comments without extensions -- it's the UI that matters. If JSON is your (or some third party app's) UI, well, I guess that's a problem, but why should be everyone else's problem?

I'm still open to considering something here, but nothing without a spec, full stop.

mewalig commented 6 years ago

Hold your horses please, I said at the top of that response that I would respond to the technical aspect separately

mewalig commented 6 years ago

Regarding technical comments:

First: "can we simplify"? absolutely.

I think even the following fairly narrow scope would be immensely useful:

Only C++ and C style comments. Or if we wanted to further simplify: only C++ style comments.
Each comment will be assumed to be "attached" to its immediately-preceding token, where a token would be anything that jq would typically put on a new line when it pretty-prints-- that is, a square or curly bracket, or any (non-key-name) terminal value
Comments are also supported before the first token
Comments are always output C-style (perhaps if output is pretty-printed, could default to C++ style)
No back-to-back comment support*
No support for comments immediately following a dictionary key, and preceding the related value*

*At least initially, though it wouldn't be that hard to add support for this if the rest of the proposed scope was already working

Note that what I am proposing is that comments can be attached not just to dictionaries or arrays, but also to terminal values. Furthermore, comments do not attach directly to a dictionary or array, but rather, attach to the opening or closing bracket of the dictionary or array, or to one of the items in the dictionary or array. Generally speaking, the guiding principle I have in mind is that the comments would attach in a 1-1 fashion (to the extent a corresponding comment exists) with each line in a pretty-printed output.

It would be conceivable to allow a comment to attach to a dictionary key name as well, but I don't think that's necessary-- certainly not for an initial iteration at least.

So using above examples:

echo '{"foo":"bar" /* comment #1 */, /* comment #2 */ "bar":"baz"}' | jq '.'

would be unacceptable because "comment 2" and "comment 1" are back-to-back (just to clarify-- I don't think we can't support this later-- only that for the initial, narrow scope, we make a decision not to)

However,

echo '{"foo":"bar" /* comment #1 */, "bar":"baz" /* comment #2 */ }' | jq '.'

would yield:

{
  "foo": "bar", /* comment #1. Note that it doesn't matter that it originally came before the comma */
  "bar": "baz" /* comment #2 */
}

Similarly,

// a comment
/* another */
[/* what is this? */ /* and this? */ null]

would be unacceptable input because of the back-to-back comments, but

jq '.' <<EOF
// a comment
[/* what is this? */  null]
EOF

would yield:

/* a comment */
[ /* what is this? */
  null
]

In addition, {"foo": /* comment */ "bar" } would be unsupported since it is between a key and its value.

I may be able to make these changes to the jq code base, but I'd sure prefer to know before putting that time in that there is at least some sign of possibility of it being merged into the official base. At the least it would not be hard to describe the above in a formal grammar.

mewalig commented 6 years ago

Re your response:

"A helpful response would point to such a standard".

Please let me know if you think this point is not already addressed. I've provided a description of a standard and would be happy to provide a formal grammar for it if it looks like there is a reasonable possibility of it being used.

"Second, people who make up their own JSON extensions and then rely on them are making a terrible mistake if they don't provide a way to get valid, unextended JSON out."

I am not suggesting that, but rather that jq could parse C/C++-style comments and then have an option to include or not include them in the output

"I don't really care how ugly it looks to express comments without extensions -- it's the UI that matters. If JSON is your (or some third party app's) UI, well, I guess that's a problem, but why should be everyone else's problem?"

I am not suggesting we use JSON as a UI, but rather that it would be valuable to support comment parsing for the same reasons that, though we don't use C as a UI, it's still useful to support comments in C, and same for HTML. Furthermore, a UI is different from technical documentation and code snippets. As an example, for the same reason code snippets that contain comments have value, JSON snippets that contain comments have value, and are more valuable when the comments do not need to be manually removed in order to use the snippet. Sure we can use Swagger or OpenAPI etc to describe JSON structure but it isn't the same as seeing some samples with comments.

Just to reiterate: I am just stating my opinion on how jq could be more useful to a wider range of circumstances and users, and how that might be implemented. Please do not take these suggestions as anything more than a desire to be helpful.

nicowilliams commented 6 years ago

Ok, thanks for addressing the technical issues.

I'm not interested in attaching comments to tokens. The problem is that that makes the internal representation of all values "fatter", whereas for objects and arrays it's not so bad because they already have a lot of overhead. So comments have to only be attachable to "slots" in objects and arrays, and perhaps the whole object/array itself. This means: no top-level comments. That's the furthest I'd be willing to go. But this is not very general, and so it's just ugly, so I think "never mind".

How about this: an JSON-with-comments transformer that lifts the input schema to make room for comments and produces an equivalent output in a new schema with comments as proper JSON values.

E.g.,

$ comments2json <<EOF
/* a comment */
false // another comment
true
[/*fooarr*/ "foo","bar" /*baz*/]
{/*fooobj*/ "foo":"bar" /*baz*}
EOF
[" a comment "]
[false, " another comment"]
[true, null]
[["foo","bar"],[null,"baz"],"fooarr"]
[{"foo":"bar"},{"foo":"baz"},"fooobj"]
$

or maybe use objects:

$ comments2json <<EOF
/* a comment */
false // another comment
true
[/*fooarr*/ "foo","bar" /*baz*/]
{/*fooobj*/ "foo":"bar" /*baz*}
EOF
{comment:" a comment "}
{"value": false, "comment":" another comment"}
{"value":true}
{"value":["foo","bar"],"comments":[null,"baz"],"comment":"fooarr"}
{"value":{"foo":"bar"},"comments":{"foo":"baz"},"comment":"fooobj"}
$

mewalig commented 6 years ago

I think that could be useful if it could convert both ways e.g. A-with-C-comments => B-with-JSON-comments can be reversed back e.g. B-with-JSON-comments => A-with-C-comments (or a filtered version of A).

If we only supported comments attached to objects, then, is there an efficient way to attach a comment to any individual element at least of a dictionary?

For arrays, it's a bit less of an issue because it's common for each element of an array to be the same type, so a single comment would apply equally to every element-- but the same isn't true of dictionaries.

Another approach that might have the benefit of not fattening any structures, not impacting performance except when the option is used and lots of comments are present, and might also work with individual tokens, might be to store comments in a separate heap, with a 1-way reference to the token it attaches to, and whenever a token is output, the comment store is checked for a match.

Obviously, doing a lookup on every token print scales terribly; however, I think we get lucky here because the use cases where one wants to process JSON that contains comments, versus one where one wants to process large quantities of JSON that contain large quantities of commments, are probably almost always mutually exclusive, and the exceptions (I would be surprised if any exist in the entire world) are simply not what this would be intended to address. So it would be entirely reasonable limit the target scenario to small data sets, and if implemented as a rbtree the performance would probably be unnoticeable for any real-world use cases. Meanwhile, the impact would be nil for JSON that does not contain comments (or when the use-comments option is turned off).

Also, it could be a pretty clean implementation, and none of the existing data structures need to be touched-- would just add the comment to the global store when parsed, and look for the corresponding token in the global store when printing. But if multiple filters run on separate threads or otherwise would cause issues in addressing shared memory, I can see how this could be problematic

nicowilliams commented 6 years ago

jq does not keep track of "tokens" once a jv value is produced internally from parsing. And values can be constructed at run-time without parsing.

The idea for hiding comments in an object was to have them in a separate object hidden inside the real one, and accessible with some special function. It's very tricky though. Lots of generality gets lost. E.g., the ability to find paths to values would not extend to comments, nor would the ability to use assignment operators. It'd be much to easy to lose track of comments too: jq '{foo:,foo}' would lose all comments in input objects -- this is really not good and would yield lots of support issues.

I think I'd rather have a filter (possibly as a parser inside jq!) that converts JSON-with-comments to JSON with lifted-schema as shown earlier. Such a filter would, of course, be reversible. It would complicate jq programs dealing with such inputs, but I don't mind that, and the complication would not be a big deal, and no generality would be lost.

mewalig commented 6 years ago

ok, I think I understand generally but the specifics in terms of how it would affect input/output are hazy (due to my own limitations). So jq would output all comments as its own isolated object, and if you wanted to re-incorporate them, you would save that output as a variable, and then use another jq filter to re-combine the comments with the output?

Rather than me try to understand every aspect of what happens in-between, maybe easiest to describe a few input/output scenarios. For the moment I'll use --input-with-comments and --output-with-comments as shorthand corresponding to some jq command(s) that i) ingest input (both comments and data) and output some intermediate form in one or more valid JSON outputs, and ii) do the inverse i.e. ingest the outputs from i) and output some JSON with C/C++-style comments-- with the understanding that what is represented by --input-with-comments might actually be a series of filters/commands that extend or prefix the other filters, and not just be a simple flag.

would the below (or equivalent C-style instead of C++ style) be supportable input?

{
"protocol": "http",          // use "http" or "https"
"host": "example.com",
"whitelist": [               // specify one or more IP addresses here
   "111.222.333.444",        // can use use ip4v
   "2001:0db8:85a3:0000:0000:8a2e:0370:7334"  // can also use ipv6
]
}

If so, could the input be de-constructed and then re-constructed? Let's say the above is saved in X-with-comments.json:

cat X-with-comments.json | jq --input-with-comments '.' --output-with-comments
# should spit out something like the below, though presumably, retaining exact whitespace would be out of scope):
{
"protocol": "http",          // use "http" or "https"
"host": "example.com",
"whitelist": [               // specify one or more IP addresses here
   "111.222.333.444",        // can use use ip4v
   "2001:0db8:85a3:0000:0000:8a2e:0370:7334"  // can also use ipv6
]
}

Can we filter the object and retain the related comment? E.g.:

cat X-with-comments.json | jq --input-with-comments '.protocol' --output-with-comments

to yield:

"http"   // use "http" or "https"

Same question but for array? E.g.:

cat X-with-comments.json | jq --input-with-comments '.whitelist | .[1]' --output-with-comments

to yield:

"2001:0db8:85a3:0000:0000:8a2e:0370:7334" // can also use ipv6

Would this be achievable with the approach you have in mind?

nicowilliams commented 6 years ago

With the schema-lifting format converter you could put comments just about anywhere and yet manage to represent where they were in the input (though maybe not line number and column number).

And you could print out the same way.

One way to use this might be:

$ # parse JSON with comments into lifted schema form:
$ jq -nR 'parse_and_lift_json_with_comments(inputs)'

$ # Identity filter, with reformatting as expected
$ jq -nR 'parse_and_lift_json_with_comments(inputs) | tojson_with_comments'

These examples assume no new command-line options to jq, just jq functions to do parsing and formatting of alternative encodings than JSON. This would extend to YAML, XML, ... Each non-JSON format would have to have a corresponding schema to represent the external data.

For example, XML would parse into arrays where each element is an object with the following keys: name (the element name, canonicalized), attributes with a value of null or an object with the attributes, children with a null or array value of child nodes. Text nodes would be strings. The whole thing would be wrapped in a top-level object that would also collect things like processing instructions and other such metadata.

For JSON with comments the schema would be roughly as described above, with every value in the original wrapped in an object that contains the value in a value key and also contains a comment key for comments associated with the value itself and a comments key for values associated with keys in the object/array. (Actually, if we can associate a comment with only values, not keys, then we don't need a comments key at all.)

mewalig commented 6 years ago

I see, that sounds interesting... so the schema-lifting functionality would be some sort of limited parser generator that, based on the input grammar (schema), generates a parser on the fly to then parse a non-JSON (presumably UTF-8) input into a JSON structure, and could apply equally well to commented JSON as well as XML YAML etc?

nicowilliams commented 6 years ago

Yes.

I think that's cleaner. It would be nice to be able to process XML in jq -- jq as a pithier XSLT/XPath, what's not to like?

mewalig commented 6 years ago

That sounds very cool. Out of curiosity, do you already have a syntax in mind for defining the schema or an existing parser generator that you would use?

nicowilliams commented 6 years ago

@mewalig I don't yet have a way to express schema designs, not at this time anyways.

nicowilliams commented 6 years ago

@mewalig Suggestions as to how to express such a schema would be welcomed!

mewalig commented 6 years ago

I initially had a very long response to this. But my thoughts kept changing as I tried to translate them into pseudo-code. In the end, I think that the best way to really figure this out would be to take 2-4 target use cases, such as pure JSON, XML, YAML, and commented-C, and flesh out exactly how they might be represented, and compare / contrast based on those examples. It is too hard (at least for me) to consider all of the potential issues and solutions in my head-- I have to get it out in writing with actual examples to arrive at an opinion.

Let's start with pure JSON, which definitely should be within scope. At the least, this should flush out the minimal set of parse rule related actions that would need to be supported. For no other reason than that I personally like bison grammar, I have used something like that below.

Note that I have left out the lex rules for STR and NUMBER. As you must know given that jq has to parse UTF-8 and JSON number syntax, the rules aren't complex but they also can't be expressed in a single rule. I'm not sure what the best approach would be for this, but one simplification that I think is reasonable would be to only support grammars that parse UTF-8 input.

json: value INPUT_END    { $$ = $1; }
;

value: object        { $$ = $1; }
value: array         { $$ = $1; }
value: terminal      { $$ = $1; }
;

object: '{' '}'    { $$ = jv_new_object(); }
object: '{' members '}'  { $$ = jv_new_object($2); }
;

members: member       { $$ = jv_new_memberlist($1); }
members: members ',' member  { $$ = $1; $$.append($3); }
;

member: string ':' value { $$ = jv_new_member($1, $3); }
;

array:  '[' ']'          { $$ = jv_new_array(); }
array:  '[' items ']'    { $$ = jv_new_array($2); }
;

items: value             { $$ = jv_new_itemlist($1); }
items: items ',' value   { $$ = $1; $$.append($3); }
;

terminal: string         { $$ = jv_new_string($1); }
terminal: NUMBER     { $$ = jv_new_number($1); }
terminal: 'true'          { $$ = jv_new_bool(1); }
terminal: 'false'         { $$ = jv_new_bool(0); }
terminal: 'null'           { $$ = jv_new_null(); }
;

string: STR          { $$ = jv_new_null($1, 0); }
;

perlun commented 4 years ago

@nicowilliams

There's NO standard for JSON-with-comments to my knowledge. A helpful response would point to such a standard.

Not really a "standard", but some useful links:

NPM package supporting parsing jsonc: https://www.npmjs.com/package/jsonc-parser (implemented by Microsoft)
VS Code documentation about this mode: https://code.visualstudio.com/docs/languages/json#_json-with-comments

This format is widely used within VS Code, which is quite useful since you can annotate your config file with comments. Indicated by this GitHub linguist PR, it's also used by other tools like .babelrc.json and Sublime Text configs. Perhaps others as well?

(I can live with it not being first-class-handled by jq, hadn't bothered me until today when trying to parse a json-with-comments file for the first time. In HTTP responses, I'd say it's much less common so it depends on what the typical use case for jq is aiming for - parsing local files (which could very well be jsonc) or parsing content coming from some remote URL (curl https://foo/bar | jq .))

leonid-s-usov commented 4 years ago

Why can't jq just accept and ignore comments for starters? Maybe with a --strict mode available (potentially, ON by default)

MikeTaylor commented 4 years ago

Just leaving this comment so I get notified of further discussion, so that when the inevitable comment-supporting fork of jq happens I get to hear about it.

MrKepzie commented 2 years ago

cc me

kevingilbert100 commented 1 year ago

+1

liquidaty commented 1 year ago

Got tired of waiting. Here's a pull request that simply strips comments of the form #, // and /*.

I'm 99% sure I missed something because otherwise that was too easy.

https://github.com/stedolan/jq/pull/2548

echo '{ "a": 1 /* hello there
} */ }' | jq

yields

{
  "a": 1
}

Or this:

echo '{ "a": 1 /* hello there
} */ , "b": 2 // hi
, "c": 3 /*
hi
there
*/,
"d": 4 # howdy
}' | jq

yields

{
  "a": 1,
  "b": 2,
  "c": 3,
  "d": 4
}

liquidaty commented 1 year ago

@MikeTaylor there it is

pkoppstein commented 1 year ago

@liquidaty wrote:

... otherwise that was too easy.

Without meaning to imply anything about the prospects of your contribution being accepted(*), one thing to consider is whether there should be an option to turn this functionality on or off. Backward compatibility and perhaps other considerations would argue for the default behavior not to change. Another issue is what the flag should be called. "--strict" is tempting, but given that your contribution does not address the known strictness issues, "--strict" might be premature.

-- (*) The jq maintainers seem to have gone AWOL. For this and other reasons, you might like to consider asking @itchyny whether this functionality might be incorporated into gojq in some way.

liquidaty commented 1 year ago

@pkoppstein I totally agree that it would be better to make this an option, and I think it would be easy to do so and would be happy to make that further enhancement. I just wanted to first get a sense of whether there was in fact any chance of it being accepted-- in either form-- before going that extra step.

Personally, I think --allow-comments or something like that would be better than --strict, so the behavior remains unchanged from before, unless opted in.

Agreed re maintenance being lacking. We already use our own a jq fork because we need utf8 upper/lower functions. Maybe everyone who is looking for a fork that can more easily accept updates can band together and create a new fork for that purpose? I'd be up for it. I see this is already under discussion at #2305

pkoppstein commented 1 year ago

@liquidaty - I'll confine this post to the proposed "--allow-comments" enhancement.

Without some kind of positive signal from @stedolan or one of the maintainers, I'd be quite pessimistic about any enhancement of the type envisioned here being incorporated into the official jq repository in the near term. However, I would suggest continuing the discussion with a view to an enhancement being incorporated into a fork or gojq.

One point of discussion is consistency between jq's two JSON parsers (i.e. the regular one, and the "streaming" parser). Is this already guaranteed in the PR?

Another point of discussion concerns comments in JSON within a jq program. Currently, I believe it's the case that any JSON that is accepted as input to a jq program will also be accepted as a jq program, or as a component of a jq program. It might be tricky to preserve consistency (between all three parsers) if only because any changes to parser.y can be tricky. In fact, allowing '//'-style comments within such embedded JSON is problematic in principle, because of the inherent ambiguity this introduces in expressions like:

  null // 1 // comment

(This could be interpreted as being equivalent to the jq expression (null // 1) # comment or as null # 1 // comment.)

Agreed, it might be best to circumvent the issue by simply leaving the existing jq parser as-is; in this case, I believe one should nevertheless ensure that the three parsers handle "JSON-with-#-comments" in the same way. But perhaps that is already the case?

pkoppstein commented 1 year ago

@liquidaty - Thanks for offering to help support a well-maintained "fork" of jq.

The last "commit" to the jq repository by a maintainer appears to have been in May 2022, and to my knowledge, the maintainers have not given any encouragement about maintaining jq for quite some time. It seems to me that unless such encouragement is offered in the near term, we've just about reached the point where it would indeed be reasonable to create a "successor fork", that would only incorporate changes that are made with a view to having them incorporated in the stedolan repository. Ideally, such changes would also either reflect changes that have already been adopted in gojq, or would be acceptable to @itchyny for potential adoption in gojq.

Obviously, maintaining jq at a high level presents a wide variety of challenges. Perhaps the place to start is to ensure that the initial team of official maintainers has all the required technical skills?

cc: @stedolan @nicowilliams @wtlangford @dtolnay @leonid-s-usov @wader @tst2005

liquidaty commented 1 year ago

@pkoppstein thanks for your comments. This response is to the first of your last two posts.

One point of discussion is consistency between jq's two JSON parsers (i.e. the regular one, and the "streaming" parser). Is this already guaranteed in the PR?

I'm not that familiar with the jq internals but from what I can tell, both cases are handled properly. The modification, which is made in scan(), applies equally to both. My trivial tests for that, which pass fine, are the aforementioned ones, plus the same ones duplicated but with --stream added.

All that said, if the PR was accepted, I think it would be extremely important for at least some basic tests to be added to verify it works as expected (and for those tests to be part of CI)

Another point of discussion concerns comments in JSON within a jq program [...] or as a component of a jq program

TL DR now this is handled.

It's not clear to me whether this should be treated as a separate issue, but I suppose it would be confusing otherwise so let's just agree it's one and the same. I've updated the PR with a tiny (4-line) change to lexer.l that adds C-style comment stripping (obviously, this could also be subject to an opt-in option) (the updated PR also includes the related derived changes to lexer.c and lexer.l).

As for C++-style comments, it seems to me that the JQ command syntax is fundamentally incompatible. So, we could i) just never strip them, ii) only strip them outside of a JQ command, or iii) maybe only strip them if inside a JSON component of a JQ comment. My own thoughts are that ii) is preferable and this is what the PR reflects, though this subtlety isn't as important to me as the overall topic of "let's support some form of JSON comment stripping" and I'd defer to whatever the more popular view was on this subtlety.

mcandre commented 8 months ago

Perhaps we could compromise with a CLI flag to specify the precise format. We continue defaulting to parsing with the classic, rudimentary JSON AST. If the user adds a flag like --json5, then we use a more flexible AST that allows for comments and so on.

Unfortunately, there don't appear to be many JSON5 transformation CLI or formatting tools. jq could be extended to be that tool.

olix0r commented 8 months ago

@mcandre I got tired of waiting for this and wrote a minimal json5 to json transformer a while ago. YMMV https://github.com/olix0r/j5j

liquidaty commented 8 months ago

@olix0r , @mcandre , @nicowilliams , @pkoppstein, [update: adding @itchyny]

Can we take a first step forward on this and commit to having something in 1.7? #2548 for example is a fully-QC-test-passing (at least as of a few months ago), super light / non-invasive update that moves the ball forward at least on a portion of this collection of issues that is very cut, dry and non-controversial.

MikeTaylor commented 8 months ago

Yes! Please!

nicowilliams commented 8 months ago

@liquidaty can you rebase that PR?

metaory commented 1 month ago

sharing my basic version,

https://gist.github.com/metaory/70f15c242ca5c423638b6fcea843171d

[!Caution] full spec is NOT covered covering the very basic that I needed

Basic JSONC Parser

[!Note] it removes

Comments to end of line

Single line block comments

Empty lines

[!Caution] its NOT covering

Full spec

Trailing Commas

comments in values

Multi line block comments

TL;DR

`s/\/\/.$//g;s/\/\.*\\///g;/^[[:space:]]$/d`

basic-jsonc.sed

:comments-to-endofline
s/\/\/.*$//g

:comment-blocks
s/\/\*.*\*\///g

:empty-lines
/^[[:space:]]*$/d

sed -f basic-jsonc.sed test-file.jsonc

# pipe to jq
sed -f basic-jsonc.sed test-file.jsonc | jq '.'

# replace in place
sed -i -f basic-jsonc.sed test-file.jsonc

Example JSONC

//test-file.jsonc
{
  // hoge
  "foo": "bar", // here
  "no": "comment in values are NOT ok",
  "@umm": "guys is /* block */ fine too?",
  /* are you fine? */
  "hmm": /* even here? */ "comment block in values not ok",
  "test": "yup some good",
  "comment at end of value?": "nope",
  "indent": "hey covered tab and nested?",
  "ind": {
    "hoge": /* just saying */ [
            "x11", /* behind me are tabs! */
      "yup u okk",

      "mixed multiline empty lines above",
      "HOUSTON WE HAVE A PROBLEM",
      "watt?",
      "TRAILING COMMAS",
      "i knw, i'm too lazy, deal with it!",
      "i can avoid them, or write more regex",
      "comments are enough for nw"
    ], // okkk
    "res": "ok"
  }
  // zxzzz
}

inline pattern and replace in place

sed 's/\/\/.*$//g;s/\/\*.*\*\///g;/^[[:space:]]*$/d' test-file.jsonc

liquidaty commented 1 month ago

@metaory I do not believe a regex can ever fully work to parse comments. Using yours for example, this fails: { "fails": /* x */ 1 /* y */ }.

Anyway a long time ago I submitted a fully-functional PR with tests and pretty much everything else needed to be drop-in and after multiple rounds of making every requested adjustment to have it merged in (other than the last one which I'm about to respond to), it hasn't gone anywhere so good luck.

liquidaty commented 1 month ago

@liquidaty can you rebase that PR?

@nicowilliams Thanks for asking but no, I'm not going to, and here's why:

I already created a complete, fully-functional, fully-tested, cross-platform compatible PR that resolved every issue anyone raised (within the scope of that PR which was focused solely on ignoring comments in input)
I repeatedly asked to get this high-demand feature included in 1.7. I was given several indications that it would
It was ignored for 1.7 and no one ever told me why, and no one gave me or anyone else any opportunity to address whatever unspoken objection existed to getting it in 1.7
I already did a bunch of git commands to get the PR in an acceptable state, and I have no idea why that was not already sufficient
I have no idea what litmus test is being used to determine whether the PR has been properly rebased, so if I was try try to do this, I would have no way to determine whether it is satisfying whatever unstated condition this task is being asked to satisfy
To summarize, I do not have confidence that this would not be a complete waste of my time and that at the end of it, the PR would not still be sitting there, unmerged, doing nothing for anyone
Whoever is deciding what gets merged and what does not-- that person is in the best position to do whatever git rebase is necessary

To be clear, I am more than happy to help as much as I can if a decision-maker is driving the process. Alternatively, if a core maintainer is willing to commit in writing that "Yes, if you do this, the PR will be merged into the next release"-- then I will reconsider. Until then, I've grown weary of jumping through hoops with no end in sight.

metaory commented 1 month ago

@metaory I do not believe a regex can ever fully work to parse comments. Using yours for example, this fails: { "fails": /* x */ 1 /* y */ }.

Anyway a long time ago I submitted a fully-functional PR with tests and pretty much everything else needed to be drop-in and after multiple rounds of making every requested adjustment to have it merged in (other than the last one which I'm about to respond to), it hasn't gone anywhere so good luck.

I rather a solution that can be a single line and covers my requirements over something that's +100 lines and covers the full spec

I already stated it doesn't cover the full spec,

in most cases we just want single line comments,

metaory commented 1 month ago

@liquidaty

@nicowilliams

I'm not sure if something that can cover most cases in a single line sed needs to be part of jq,

metaory commented 1 month ago

@metaory I do not believe a regex can ever fully work to parse comments. Using yours for example, this fails: { "fails": /* x */ 1 /* y */ }.

thanks I've updated the readme,

honestly most cases can be a single line sed,

for anything more accurate and spec compliant, pipe it in from one of many well known spec compliant parsers

tianon commented 1 month ago

Arguably, many uses cases of jq itself can be covered by a single line of sed (I used to use sed/grep/awk to parse JSON before finally committing to using jq) -- the benefit to using jq instead is that it's much more expressive and more importantly, much more correct, so the edge cases don't bite as hard as they tend to with sed/grep/awk-based solutions.

metaory commented 1 month ago

converting something that is NOT JSON to JSON should be on jq as well?

this is exactly that.

not far from yaml, toml, and so on

jqlang / jq

JSON comments-- again #1571

Basic JSONC Parser

TL;DR

`s/\/\/.$//g;s/\/\.*\\///g;/^[[:space:]]$/d`

basic-jsonc.sed

Example JSONC

inline pattern and replace in place