platers / obsidian-linter

An Obsidian plugin that formats and styles your notes with a focus on configurability and extensibility.
https://platers.github.io/obsidian-linter/
MIT License
1.18k stars 80 forks source link

Bug: YAML modified timestamp / order of operations #1162

Open marblewraith opened 1 week ago

marblewraith commented 1 week ago

Describe the Bug

The order of operations is wrong, leading to unnecessary / excessive "modified" YAML property updates.

How to Reproduce

  1. Create a note with a YAML "modified" timestamp property set to be auto updated by Linter.

  2. Fill the note with typical markdown (text, lists, etc).

  3. Make change(s) A, that conflict with any other configured Linter rule. Examples: double blank line, a new bullet point with no text, a double space between words.

On save, Linter will make the correction(s). Yet despite the note content being precisely the same as before change A, the YAML timestamp of the designated "modified" property will still be updated.

Expected Behavior

Linter should be executing all the linting operations first, then doing a diff of the file to determine whether the content has actually changed. Then finally updating the YAML "modified" property if necessary.

Additional Context

Reasoning

Human error is inevitable. The entire design and purpose of linters/formatters, is acknowledging this fact to preempt it.

But in this specific case, Linters solution is ignoring human error and plastering over the modified property to "make it pretty".

I realize this proposal will lead to inconsistency with meta-data tracked by the file system. But aside from it maybe triggering my OCD a little, so what?

It's not like when the commit date / merge date in git doesn't line up precisely with the modified file attribute(s), devs are like: "Welp it must be invalid, i gotta mod the commit / revert..."

Why should this be any different?

I'd much rather have the modified timestamp be accurate and meaningful in relation to the content itself, over "the formality" of displaying filesystem / OS related data.

And in the scenario users want "the formality", it's not as though there aren't other options in Obsidian:

Side benefit

Disclaimer: I haven't looked deeply into Linters source, so from here it might just be me waffling.

The simplest way to diff the file contents is to store an additional copy of it in memory ie. File as first opened in an editor vs file after lint ops are complete.

There is also another aspect of the codebase that uses this functionality. Debugging and Testing.

pjkaufman commented 1 week ago

Hey @marblewraith , thanks for the feedback. The Linter currently does pretty much what you stated above with some slight differences if I read correctly.

The Linter compares the file contents and if they have been updated, the the value of the modified field will be updated. However, that field is also updated if the value in the YAML is more than 5 seconds different from what is stored in the file metadata or if the desired format or localization format is no longer the same. For a more in depth explanation, see https://platers.github.io/obsidian-linter/settings/yaml-rules/#yaml-timestamp.

pjkaufman commented 1 week ago

It does sound like you prefer that the content be the source of truth for the modified field which I missed in my initial reading. However, it is not possible in the Linter to do what you are saying. Here is why:
How does the Linter know when the file was last updated by the user?
It has no way to know this for 2 reasons. The first is file edits can happen outside of Obsidian. If that edit happens, the user would now be responsible for handling keeping the modified field up to date. The second issue is that the Linter does not know when a user edits a file in Obsidian. There are requests to have the Linter run when the current files contents change after a certain amount of time. However right now that is very difficult at a technical level. Right now I am attempting to get changes in place that make that change feasible. However it is not currently feasible with how the Linter works because it would cause an infinite loop of updates (there is no good way to currently weed out the Linter modifying a file from the user doing so).

It sounds like this has been mentioned because of an issue with manually syncing notes between filesystems. The Linter's date modified value is not designed for such syncing at this time, but other plugins are designed to address that issue.

TLDR
Essentially, the file system is currently the only reliable source of date created and date modified information the Linter can use. Thus if this is caused by syncing multiple devices, this is currently an unsupported use case.

I am open to PRs that would alleviate this issue (i.e. a setting and functionality that allow the user to specify that they only care about edits made in Obsidian by the user and the logic to properly handle watching file edits).

pjkaufman commented 1 week ago

Do note, I have been thinking about this and there is an already open issue here. If all you want is the yaml timestamp rule to run after a user edits the editor and no other rule runs then I can possibly make that work. I am currently trying to see if users actually want the Linter to lint a file after the user edits a file or whether they want just the timestamp rule to run after a user edits a file. If it is the latter, it is feasible to implement this with the current way the Linter works, but I may have been misunderstanding what users want.

marblewraith commented 1 week ago

It does sound like you prefer that the content be the source of truth for the modified field which I missed in my initial reading.

Correct, the content of the note should be the source of truth of the note.

However, it is not possible in the Linter to do what you are saying. Here is why: How does the Linter know when the file was last updated by the user?

It has no way to know this for 2 reasons. The first is file edits can happen outside of Obsidian. If that edit happens, the user would now be responsible for handling keeping the modified field up to date.

Which isn't a problem?... I mean yes it is a problem from a consistency standpoint, but it's not the responsibility of Linter.

Linter is only dealing with note files within the scope of Obsidian. It's unreasonable to expect it to do things outside of that.

This reads like a non-sequitur: "What-if files are updated outside Obsidian?"... Well what-if a user has non-ECC memory and a cosmic ray triggers a bit flip? Linter going to account for that too? 🤔🤣

I suppose if this scenario of "outside interference" needs to be accounted for, the modified file attribute could still be eval'd as a check?

If the modified file attribute is further into the future then the timestamp stored in the frontmatter properties, it would indicate outside interference, thus Linter can update the property to match the file attribute.

Not perfect, but the only option (i think?).

The second issue is that the Linter does not know when a user edits a file in Obsidian.

There are a few ways to do this. But as stated, the way I'd choose would be to keep a copy of the opened file content in memory. That is:

  1. When a note is opened in a editor tab, take a snapshot (copy) of the state of its contents.

  2. When Linter has been executed on the file itself, take the Linted file output, and compare it to the snapshot (diff).

  3. If the two are exactly the same do not update the "modified" timestamp, else update.

Granted there's an associated performance penalty (keeping both source and a copy in memory). But given there is typically an "upper-bound" for both the size of a note, and the max number of notes a user has open simultaneously, it's probably negligible.

I am open to PRs that would alleviate this issue (i.e. a setting and functionality that allow the user to specify that they only care about edits made in Obsidian by the user and the logic to properly handle watching file edits).

Excellent. I'm a bit tied up right now, work likes to frontload things in the latter half of the year to preempt holidays 😩, but i'll try and allocate some time ASAP.

If what you say is true and it's too difficult to change how the existing "modified" property works in Linter, perhaps there's a way to build out an entirely new feature that "treats the content as the source of truth".

This issue can be left open if you intend to look further into it, otherwise close it. I'll reopen it when i come back with a pull request, or who knows, i might dev a standalone plugin to handle it.

In any case, i appreciate the punctual response 💯

pjkaufman commented 1 week ago

I just wanted to add that the best way I see this working with the least overhead is as follows:

Add an option to the yaml timestamp rule to run a configurable amount of seconds after a user edits a file (this logic would need to be disabled when a file/folder/vault is linted). Thus after a user edits a file, it can run with the minimal amount of logic to allow for updating the YAML Timestamp. Also, there would be an option to have the file metadata be the source of truth or only respect changes made in Obsidian for modified date time changes. Note that a downside is that it assumes that a change in the editor is meant to update the timestamp no matter what, so even if the change is immediately discarded. I could take a stab at this kind of implementation this weekend if this the route we want to go. But if we wanted to go the route of running all rules, then that is a much larger concern and not as simple.

Storing files in memory while feasible in some cases, gets unreasonable for more users than I would like to admit who have several megabytes files and some are long books and essays which makes things worse. Some of the issues reported on the Linter have spawned from this and they cause performance issues (big performance issues).

marblewraith commented 1 week ago

I could take a stab at this kind of implementation this weekend if this the route we want to go. But if we wanted to go the route of running all rules, then that is a much larger concern and not as simple.

This is effectively what i want tho'... After all linting has been run on a file, is the file contents exactly the same as before linter was executed?

If so, modified timestamp in frontmatter properties stays the same, even if the modified attribute in the file system metadata changes.

Storing files in memory while feasible in some cases, gets unreasonable for more users than I would like to admit who have several megabytes files and some are long books and essays which makes things worse. Some of the issues reported on the Linter have spawned from this and they cause performance issues (big performance issues).

Interesting. If you have the time, i'm curious to know what they are / if you could link to them?

I realize there are performance issues with malloc in V8 chromium engine, but i didn't think they'd be so bad as to prevent half a dozen files from being open / functioning correctly, even with copies.

Just going off intuition even in those "extreme cases", we're still only talking Megabytes. In today's device market, even at the budget end, devices have Gigabytes in terms of RAM.

I had a few more thoughts.


Thought 1: FCIM

I suggested keeping a copy of content in RAM because it's the easiest way to allow future implementation of other Lint rules.

Pursuing this idea of a full copy in-memory (FCIM). Obsidian has 3 views: source, live preview, reading.

Just a cursory look at the DOM tells me live preview is different then reading. Doesn't that mean reading view is it's own copy?

If that's true, is there a way to freeze it? By freezing reading view and preventing it from being updated in real time (maintains initial state when opened), if it can be "mapped" (compared) to source mode / live preview, functionally it serves the same purpose as FCIM.

Thought 2: Hashing

Disregarding future lint rules, and focusing on the ability itself ie. modified timestamp updating if content changed (even after lint), Git came to mind.

Rather then storing complete copies of files, IIRC it does things with hashing and zlib compression. Maybe it's possible to do something similar here?

So instead of doing FCIM:

  1. Whenever a file is opened in an editor read the content, compress / hash it to string A. Store in-memory, even in the frontmatter of the file itself, or a file cache somewhere accessible to Obsidian.

  2. After Linter is executed, do the same process to get the new hash string B.

  3. Simple string comparison of hashes A and B. If not the same update modified timestamp and the hash, else do nothing.

Essentially this is trading memory for compute power, and probably more difficult to implement lint rules for in future, but maybe it works?

pjkaufman commented 3 days ago

I could take a stab at this kind of implementation this weekend if this the route we want to go. But if we wanted to go the route of running all rules, then that is a much larger concern and not as simple.

This is effectively what i want tho'... After all linting has been run on a file, is the file contents exactly the same as before linter was executed?

If so, modified timestamp in frontmatter properties stays the same, even if the modified attribute in the file system metadata changes.

To clarify, are you saying that it would be fine to add the following two settings and it would encompass what you are looking for?

Storing files in memory while feasible in some cases, gets unreasonable for more users than I would like to admit who have several megabytes files and some are long books and essays which makes things worse. Some of the issues reported on the Linter have spawned from this and they cause performance issues (big performance issues).

Interesting. If you have the time, i'm curious to know what they are / if you could link to them?

I realize there are performance issues with malloc in V8 chromium engine, but i didn't think they'd be so bad as to prevent half a dozen files from being open / functioning correctly, even with copies.

Just going off intuition even in those "extreme cases", we're still only talking Megabytes. In today's device market, even at the budget end, devices have Gigabytes in terms of RAM.

So, for clarification, the issue is not so much with RAM per se. It is more that there are issues with the parser in question, so my objections may not be as warranted as I thought when I made the statement. The main issues are long files that are essentially treated as a single paragraph or many really large paragraphs as that causes Obsidian to load the whole file at once and the Linter which uses the remark parser often takes close to 10 seconds to run against or it outright stalls.

Two previous examples of this were

I have since fixed the performance issue with table identification by creating my own table identification logic which is much faster. But there are still issues/reports of the Linter causing the app to freeze (see https://github.com/platers/obsidian-linter/issues/872).

Because of this, I don't want the Linter to accidentally run and trigger itself to run again on all rules because it will almost definitely cause the UI to lock. Using an in memory cache of the file's contents could actually handle this scenario, but I much prefer the idea of keeping a hash of the contents since it should be smaller in size (if my understanding is correct).

I had a few more thoughts.

Thought 1: FCIM

I suggested keeping a copy of content in RAM because it's the easiest way to allow future implementation of other Lint rules.

Pursuing this idea of a full copy in-memory (FCIM). Obsidian has 3 views: source, live preview, reading.

Just a cursory look at the DOM tells me live preview is different then reading. Doesn't that mean reading view is it's own copy?

If that's true, is there a way to freeze it? By freezing reading view and preventing it from being updated in real time (maintains initial state when opened), if it can be "mapped" (compared) to source mode / live preview, functionally it serves the same purpose as FCIM.

I am not too sure the current view matters that much. Changes really should not happen in reading view though you can trigger them by linting the file. But beyond that, the sources of each is supposed to be the same regardless of which view is currently in place. I am not opposed to keeping the contents of open files in an in memory cache, but I am not sure how this works when you lint all files in the vault or in a folder. Could you elaborate on what you would think would happen in these scenarios?

Thought 2: Hashing

Disregarding future lint rules, and focusing on the ability itself ie. modified timestamp updating if content changed (even after lint), Git came to mind.

Rather then storing complete copies of files, IIRC it does things with hashing and zlib compression. Maybe it's possible to do something similar here?

So instead of doing FCIM:

  1. Whenever a file is opened in an editor read the content, compress / hash it to string A. Store in-memory, even in the frontmatter of the file itself, or a file cache somewhere accessible to Obsidian.
  2. After Linter is executed, do the same process to get the new hash string B.
  3. Simple string comparison of hashes A and B. If not the same update modified timestamp and the hash, else do nothing.

Essentially this is trading memory for compute power, and probably more difficult to implement lint rules for in future, but maybe it works?

Lint rules as well as hashing itself would hardly be used in any rules themselves as I see it. Right now, we check if the text has changed at all when linting a file happens. If it has, the date modified will be updated. However it sounds like all you want is there to be an ability to specify that the value from the file system be ignored when setting the date modified (i.e. let the Linter making an update to a file be the sole source of truth).

I see hashing more as a way to prevent linting a file that was already linted with the same rules. It is more in the case of running all linting rules on a file x seconds after the user last made a modification to a specific file. But that may not be necessary if all we need is the two settings I asked about above.

Feel free to give your thoughts on this. I know mine can seem quite jumbled, so thanks for bearing with me!