Open rxlecky opened 1 week ago
@rxlecky , thanks for reporting this issue. It should not be too hard to address from my understanding. It just means stripping all starting and ending '
or "
when they are both present and then using that remove duplicates. Then at the end it will need to use the setting for the default escape character to determine which escape character to add back if any. This sounds reasonable enough to add back, so I will see about adding it when I get time.
Thanks for taking the time to review and respond @pjkaufman! I don't have much experience with JS development but this seems like a good entry-level issue. Since you already took the time to analyse and break down what needs to be done, I'm happy to take up writing a PR to fix this issue.
@rxlecky that works for me. Really, I think this just needs happen in getUniqueArray.
The following would likely need to change into a for loop that adds to a set:
return [...new Set(arr)];
Would become something like
const set = new Set<string>();
for (let i = 0; i < arr.length; i++) {
// check for starting and ending with either ' or "
// if so strip that from the value for the set check
// check if it is in the set, if so skip
// if a value was escaped, we will need to add back the escaping when we add it to the array
}
You may find isValueAlreadyEscaped and escapeStringIfNecessaryAndPossible helpful.
We may need a way to know if we should be using escaped strings when handling duplicates (i.e. a setting or something passed into the rule that lets the rule know to escape the value). But we could just add to an array alongside the set or maybe swap to a map that goes from the value to the index in the array, so we can update the value if we find that one of the values is escaped, so we can make sure that the value is escaped when we dedupe. I am not really sure on the best way to handle this.
The contributing guidelines may also be helpful for getting started with contributing to fix this issue: https://platers.github.io/obsidian-linter/contributing/bug-fix/
Feel free to reach out if you encounter an issue.
Nice, I had a look the other day and I also identified the getUniqueArray
method as the best place to implement the fix, so it's good we're on the same page.
I was going over the YAML specs to figure out what are all the edge cases related to escaping to get the idea for scope of this fix and I found it will be a bit more work than I initially anticipated.
Here's the breakdown of required work according to my findings:
Correctly detecting duplicates and squishing them into a single value:
There are simple cases, like the ones that I listed in my original bug report, which would be solved by the simple trimming of escape quotes as you suggested. But then there are also a bit more intricate cases, such as when using escape sequences, in which case the simple trimming would not do the trick. Here are some examples of triplets of values that should be recognised as three instances of the same value:
val
, "val"
, 'val'
quote'
, "quote'"
, 'quote'''
backslash\
, "backslash\\"
, 'backslash\'
For this reason I would suggest using the jsyaml.load()
method to parse the values to take care of all the escape sequences resolution, and only then do the deduplication pass ourselves on top of that.
Detecting when the value needs to be escaped:
There are certain characters and character sequences that are not valid in plain scalar format - i.e. non-escaped value format - and the values containing them need to be escaped to be considered valid YAML values. Additionally, the invalid characters and character sequences differ, based on whether the array is written in single-line or multi-line format. Based on my testing so far, the formatYamlArrayValue
method and the escapeStringIfNecessaryAndPossible
method that it's using under the hood don't handle escaping based on illlegal characters and character sequences at all. The only escaping it appears to be doing is for numeric values. So these functions would need to be updated with this functionality.
Choosing the correct escape character:
As you already said, we need to choose a strategy of how values are escaped after deduping. I suggest adding a new option to the rule, escapeStrategy
, with two possible values: escapeOnlyIfNecessary
and preserveEscapeCharacters
. These would function as follows:
escapeOnlyIfNecessary
: Regardless of what escape style the values were originally using, only escape the values if necessary, otherwise use plain scalar format. If values need escaping, respect the configured default escape character, regardless of which character was used in the original array.preserveEscapeCharacters
: If values in the original array were escaped, respect the user's choice of escape character, regardless of whether the values strictly need escaping or not. If the original array contains variants of the same value escaped with both single-quotes and double-quotes then the configured default escape character is used out of the two.Here we would also have to make sure that the force-yaml-escape
rule overrides this, and always escapes the selected keys, regardless of the selected general escape strategy.
Let me know if this sounds good to you and if you have any additional remarks. I will start slowly working my way through the list and address your feedback as I go. Also, let me know if you feel this work is beyond the extent of this issue and would prefer me to break it down into multiple PRs.
Hey @pjkaufman, ho here's a little update from me. I started writing tests to cover the necessary edge cases that I mentioned but I quickly came across multiple bugs in the custom YAML parsing code, in particular the convertYAMLStringToArray
method. This made me think whether it wouldn't be better to use a YAML parsing library instead of rolling our own parsing solution. I tried rewriting the method using the js-yaml library that's already in use but its capabilities are limited to just conversion between YAML and JS objects. It doesn't support parsing into some intermediate parse tree format which would be much more useful for us since we're trying to preserve as much of the original format as possible which is all lost in the conversion from/to JS objects. I went to have a look at some alternatives and found the yaml package. It is significantly more feature rich - including parsing into syntax tree representation -, it is the officially endorsed library by the YAML project, fully conforming to the YAML specs, and it's in active development, unlike js-yaml which hasn't seen an update in over two years.
Now I realise that this is beyond the scope of a small bugfix that this originally started as but I think it would be worth switching to a more robust YAML package and leveraging it to simplify our own YAML handling code, rewriting it as a thin wrapper around the library code where possible. Not only would it help eliminate the existing bugs, but it would also help prevent more in the future as doing the manual YAML parsing is rather error-prone, and frankly, it's just not worth reinventing the wheel. It may also help developing new, more intricate features in the future. I saw that switching to this library was already discussed once in #617 but it seems it was never followed through. Would you be okay with me testing the waters and trying implementing some functionality with it? If it goes well and if it can replace most of our custom parsing code we could eventually phase out js-yaml entirely.
If the yaml package doesn't support all the necessary functionality, I'm not necessarily set on using it in particular and we can have a look for alterantives. But I think it's clear that js-yaml is not quite cutting it anymore and we'll need a replacement sooner or later. It would be good to hear @mnaoumov's opinion as well, since he had some concerns regarding its capability to do roundtrip parsing in the above mentioned issue.
Let me know your thoughts on this @pjkaufman, I don't want to dive too deep without first consulting next steps with you.
@rxlecky , I would definitely be open to swapping to use that package. I actually was working on some changes around that which is hitting some bugs in that parser around a specific scenario. However, I still think swapping to that library is probably the route to go. I had just started making the change on the branch here. It is on my own fork of the repo, but it does add a base for the change when sorting YAML. Now I can say it does seem to have an issue with dealing with multi-line arrays that are empty, but that may just be the one case in question.
I am definitely open to making the YAML experience more robust. I am just not sure on all cases. But if there is a good way with yaml
to preserve the existing format while copying the value, I say go for it.
I believe my above branch does completely phase out the use of js-yaml
, but it needs some typings added and properly positioned to make it work without the need to ts-ignore or do some of the hacky things that it is.
Awesome, that's great news! Should I wait for your change to get merged before I start dabbling with the yaml
pacakge?
You can definitely start dabbling with it now. Feel free to yank some of the changes from the existing branch where needed. Right now I am not sure when it will be ready due to my availability.
Describe the Bug
The
Dedupe YAML Array Values
rule fails to detect differently escaped variations of the same value as duplicates.How to Reproduce
Here's an example markdown file with a duplicate value in its array property, only differing in how the value variations are escaped:
Expected Behavior
When running linter with
dedupe-yaml-array-values
rule turned on, I would expect the three instances of the valuea
to be collapsed into one. Instead, the property is left unchanged.Config
Logs
Additional Context
Add any other context about the problem here.