[Feature] Comment Preservation Proposal

tracker1 commented 3 years ago

In addition to ideas from #22 I wanted to make some specific suggestions about comments support.

const result = toml.parse('...', {
  preserveComments [boolean] = true,  // default: preserve comments
  enumerableComments: [boolean] = false // default: comments property is not enumerable
});

preserveComments when not truthy will exclude the ___COMMENTS___ property altogether
enumerableComments (See Object.defineProperty.) would allow for the ___COMMENTS___ property to be enumerable. This would allow for preservation via JSON.stringify, default would be false. toml.stringify would explicitly check for values and include them if non-whitespace comments are present.

The property for comments would literally be result.___COMMENTS___; (three underscores before and after)

The format for comments would be:

result,___COMMENTS___.${entry}.BEFORE - entries and document
result,___COMMENTS___.${entry}.INLINE - only for entries
result,___COMMENTS___.${entry}.AFTER

The entry key for the document is result.__COMMENTS__.___DOCUMENT___

What would be returned, should be the same object structure as before with an optionally enumerable (false by default) ___COMMENTS___ property, three underscores, all caps.

As mentioned in @scriptcoded 's comment Multiple comments together should be merged with \n.

# Top of file
# Multiline?
      << must have an empty line to break top of file and before names, otherwise will be before names
# before names
[names]
# before foo
"foo" = "Mr. Foo" # inline foo
# after foo
   << blank line, otherwise comment above would be attached to names.bar
# before bar
"bar" = "The big Bar"
  << blank line, anything below attached to next section
# before baz
"bop" = "bopper"
# before boom
"boom" = "boomer"
# after boom
  << blank line...
# bottom or file

Edge cases... For multiple interspersed comments/lines, the first block with a blank line after, and no blank after the node will belong to the previous node's AFTER, all additional nodes will be part of the next node's BEFORE, The document nodes will be before/after as such.

# comment1
# comment 2
   << blank
# comment 3
  << blank
# comment 4
[first]

Becomes:

{
  ...
  ___COMMENTS___: {
     ...
    ___DOCUMENT___: {
      BEFORE: " comment1\n comment2\n\n comment3"
    }
  }
}

NOTE: spaces after # preserved, no whitespace lines preserved except for the case of \n \n in between comment blocks, but any whitespace on the original, otherwise blank lines are removed.

With two \n\n together the output blank line stays blank... if there was a block between, even if no whitespace, should parse with \n \n space between the \n and will output # with no space on the blank line, even if the original had one. This may not align with the original input, but is probably the best option for dealing with this edge case.

Comment sections beginning with \n mean there are blank lines at the top of said block, similarly for the end, any dangling \n indicate blank lines.

# block
     << has whitespace, no other comments

becomes block\n whitespace on blank line removed, but dangling \n is blank line.

\n\n comment

becomes:


# comment

empty line(s) at start.

Carriage returns (\r) are dropped.

# start
#    
# more

The middle line (#) with whitespace, will result in start\n \n more where the extra whitespace is dropped to a single space, similar for otherwise blank lines being replaced with nothing, but having extra \n

tracker1 commented 3 years ago

Made several edits to first post for clarification and some simplification... hope this makes sense and would help for defining the expected behavior.

scriptcoded commented 3 years ago

What a nice notification to receive a WFH friday morning :)

This is a great suggestion, and a lot more fleshed out than the few words I posted. I have a few points that I'd like to bring up:

If this API is intended to be public, and I think it should be, might we want to use a more developer friendly name for the ___COMMENTS___ property? Generally speaking I don't seem to see triple underscores being used a lot in the JS community, and my first thought was that it was a double underscore. A common convention is to use a single underscore to indicate a private property (_comments), but I can see this causing a naming conflict. Perhaps the middleground, double underscores?

I guess the safest way of doing it would be with symbols (myToml[TOML.Comments]), but that would pose issues with both serializing and supporting older versions of Node.
For multiple interspersed comments/lines, the first block with a blank line after, and no blank after the node will belong to the previous node's AFTER, all additional nodes will be part of the next node's BEFORE, The document nodes will be before/after as such.

How would the following TOML be handled in that case?
```
# Comment 1

# Comment 3
[foo]
bar = 123

# Comment 4

# Comment 5

# Comment 6
```
I'm specifically concerned with the last comments (5 and 6). Since there's no next node, will they be part of bar's AFTER, or instead be part of ___DOCUMENT___'s AFTER?
Continuing on my last point, is there a risk that the AFTER property will cause confusion? Take the following TOML:
```
[foo]
# This section contains fooy things

amount = 2
```
The comment obviously belongs to the section [foo] and will en up in foo.AFTER. Now, what about the following TOML:
```
[foo]

# This section contains fooy things

amount = 2
```
If I've understood your proposal correctly the comment will now end up inside amount.BEFORE instead.

I see a potential of a lot of these edge cases coming up, and many of them will likely be because us humans can take a decision based on context, while a piece of code can't.
1. This last point is about how we will handle multiline strings. I just want to make sure that I understand how it would work.
  
  In the case of this string: Hi,\n \nThis is a new line the following TOML would be produced:
```
# Hi,
# 
# This is a new line
```
  And in the case of this string: Hi,\n\nThis is a new line the following TOML would be produced:
```
# Hi,

# This is a new line
```
  Would that be correct? If so, then this string: A\n\nB\n\nC would result in this:
```
# A

# B

# C
```
  But the issue then would be that it would be deserialized into two separate strings: A and B\n\nC, right? I might be all wrong here, just want to make it clear.

That turned into more than I initially thoughts. Anyways, I'm looking forward to hearing what you have to say :smile:

tracker1 commented 3 years ago

By default it will be hidden as a property with enumerable: false so it won't serialize via JSON.stringify or the like by default (same for loggers)... the override option is only for changing this behavior. The triple undersores is to expressly avoid potentially conflicting.
the blank line after bar would leave 4-6 as part of document's after
3a: foo after, 3b foo.amount before
yes, you are understanding correctly. In the case of your A, B, C ... the only way they'd get separated is if one was adjacent as an "after" block, without an empty line between.

And yes, the detection could be flawed, that said, should be able to serialize/deserialze the comments relatively easily... modifying comments programatically, a bit harder.

For me the biggest issue is being able to load, edit a value, and save with comments, so my config files wouldn't lose instructional comments.

colelawrence commented 3 years ago

I see a lot of value in this, even if there are some edge cases that are strange. Compared to having all comments removed, having some level of comment preservation (even if lossy), would be hugely valuable!

I can see the appeal of trying to make it lossless in both directions, but that doesn't seem to be a goal of this project in the first place as it doesn't maintain whitespaces between parse & stringify either.

So, is there any appetite for doing this work?

I am fairly confident that @tracker1's suggestions around comment preservation are sound, and if you were to have any concern about backwards compat, I'd encourage adding parseWithComments as a separate function, to avoid breaking any existing code.

LongTengDao commented 2 years ago

(I have a stringify solution discussion https://github.com/toml-lang/toml/issues/854, which includes comment:)

Comment

Another pain is comment. Obviously we don't want a configuration file that contains comments lose all comment information after being modified by programs.

In JavaScript (I don't know whether it's possible for other languages), [commentFor(key)] as key in tables (this gives you a symbol as key, and the value should be the comment content string), so that the comment is after the value belong the key in the final serialization.
const TOML = require('@ltd/j-toml');

const { commentFor } = TOML;

TOML.stringify({
    [commentFor('key')]: ' this is a key/value pair',
    key: 'value',
    dotted: {
        [commentFor('key')]: ' this is a dotted key/value pair',
        key: 'value',
    },
    table: {
        [commentFor('header')]: ' this is a table header (but it cannot be a table in an array of tables)',
        header: Section({
        }),
    },
});
key = 'value' # this is a key/value pair
dotted.key = 'value' # this is a dotted key/value pair

[table.header] # this is a table header (but it cannot be a table in an array of tables)

tracker1 commented 2 years ago

@LongTengDao I do like the appearance, for the manual access formatting... that said, that type of property declaration would wind up in the object model, and parse/stringify would get noisy going to/from JSON...

Perhaps...

const obj = TOML.parse(...);
TOML.commentBefore(obj, "dotted.key", "This is before the dotted key/value pair");

As a convenience access method... where the third parameter if set, sets the value... if it's only the first two, it gets the value... and to clear/remove a comment the third parameter is an empty string.

tracker1 commented 2 years ago

Alternatively...

const comment = TOML.commentBefore`${obj}.dotted.key`(); //get
TOML.commentBefore`${obj}.dotted.key`("new value..."); // set
TOML.commentBefore`${obj}.dotted.key`(""); // clear

Maybe just TOML.comment and TOML.commentAfter as the method names... wether using processing syntax or not.

iarna / iarna-toml

[Feature] Comment Preservation Proposal #42

Comment