jupyter-server / jupyter_server_fileid

An extension that maintains file IDs for documents in a running Jupyter Server
https://jupyter-server-fileid.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
3 stars 12 forks source link

remove mtime fallback #41

Closed dlqqq closed 1 year ago

dlqqq commented 1 year ago

cc @ellisonbg @davidbrochart @kevin-bates @afshin

Problem

Right now, if the filesystem does not support creation timestamps, we default to looking at the modified timestamps to verify the file is really the same file. We do this because relying on an inode number alone doesn't provide a strong guarantee of identity. For example, let's say we index the file foo with an ID of 1 (not using UUIDs here for brevity).

rm foo
touch foo

This sequence of out-of-band creates and deletes generally preserves the inode number, but in reality the new file is completely different from the old file. Thus, the current implementation verifies that both the inode number and the modified timestamps are identical to the one recorded previously before assigning the same file ID to a given path. So here, file_id_manager.get_path(1) correctly returns None. We term this behavior mtime fallback.

However, one issue with mtime fallback is that out-of-band edits result in the file no longer being associated with its file ID. For example, editing foo via vim results in file_id_manager.get_path(1) returns None, even though foo is still the same file. Furthermore, the file ID 1 gets completely dropped from the database and never used again.

Originally, the initial design was to make get_path() as accurate as possible, because I viewed that incorrectly associating a file ID with a different file would be a bad outcome. However, as we are using file ID more, we are noticing that incorrectly dropping a file ID is an even worse outcome, as it can lead to catastrophic loss of data associated with a file ID. As @kevin-bates puts it:

You're choosing this "weird edge case" (that also requires the new file to have the same inode as the old file!) over simple (everyday) modifications. Could you forgo the timestamp checks against modification time when creation time is not available and you know the inodes are equal? All references should not be orphaned because a file was modified and the user happened to be on some filesystem that doesn't track crtime.

While #40 would eliminate this problem for ext4, ext3 and other filesystems without creation timestamps would still have this issue.

Proposed Solution

Add a configurable boolean trait called mtime_fallback on LocalFileIdManager, defaulting to False. Only perform mtime_fallback if this trait is True.

ellisonbg commented 1 year ago

Sounds good to me, I do think that having more universal support for actual creation times will help.

On Fri, Oct 28, 2022 at 3:41 PM david qiu @.***> wrote:

cc @ellisonbg https://github.com/ellisonbg @davidbrochart https://github.com/davidbrochart @kevin-bates https://github.com/kevin-bates @afshin https://github.com/afshin Problem

Right now, if the filesystem does not support creation timestamps, we default to looking at the modified timestamps to verify the file is really the same file. We do this because relying on an inode number alone doesn't provide a strong guarantee of identity. For example, let's say we index the file foo with an ID of 1 (not using UUIDs here for brevity).

rm foo touch foo

This sequence of out-of-band creates and deletes generally preserves the inode number, but in reality the new file is completely different from the old file. Thus, the current implementation verifies that both the inode number and the modified timestamps are identical to the one recorded previously before assigning the same file ID to a given path. So here, file_id_manager.get_path(1) correctly returns None. We term this behavior mtime fallback.

However, one issue with mtime fallback is that out-of-band edits result in the file no longer being associated with its file ID. For example, editing foo via vim results in file_id_manager.get_path(1) returns None, even though foo is still the same file. Furthermore, the file ID 1 gets completely dropped from the database and never used again.

Originally, the initial design was to make get_path() as accurate as possible, because we viewed that incorrectly associating a file ID with a different file would be a bad outcome. However, as we are using file ID more, we are noticing that incorrectly dropping a file ID is an even worse outcome, as it can lead to catastrophic loss of data associated with a file ID. As @kevin-bates https://github.com/kevin-bates puts it:

You're choosing this "weird edge case" (that also requires the new file to have the same inode as the old file!) over simple (everyday) modifications. Could you forgo the timestamp checks against modification time when creation time is not available and you know the inodes are equal? All references should not be orphaned because a file was modified and the user happened to be on some filesystem that doesn't track crtime.

While #40 https://github.com/jupyter-server/jupyter_server_fileid/issues/40 would eliminate this problem for ext4, ext3 and other filesystems without creation timestamps would still have this issue. Proposed Solution

Add a configurable boolean trait called mtime_fallback on LocalFileIdManager, defaulting to False. Only perform mtime_fallback if this trait is True.

— Reply to this email directly, view it on GitHub https://github.com/jupyter-server/jupyter_server_fileid/issues/41, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAGXUCP7SHFTQFG6ODN6UTWFRJDBANCNFSM6AAAAAARRPGONE . You are receiving this because you were mentioned.Message ID: @.***>

-- Brian E. Granger

Senior Principal Technologist, AWS AI/ML @.***) On Leave - Professor of Physics and Data Science, Cal Poly @ellisonbg on GitHub

kevin-bates commented 1 year ago

If the default is to only use creation times (when available) and given the expense of adding a configurable trait[*] that can likely introduce wrong behavior, should we defer adding the trait until we know it's necessary?

For example, to enable this trait, the user is saying to treat an out-of-band modified file as a NEW file. For the case in which the file had been deleted and recreated with the SAME name, and that the inode was unchanged, one could easily argue that the user's intention was to preserve the meaning of the file (by retaining the name), and they're just "starting over". But, simply modifying the file, should never resort in the orphaning of all external references.

I suspect there will be other edge cases that will be more severe (e.g., hard and soft links, other filesystem types) that can be focused on for now.

[*]: Adding a configurable trait is expensive. It requires documentation, is one more "knob" a user needs to fully understand (especially in this case) and requires a deprecation cycle to be removed.

dlqqq commented 1 year ago

should we defer adding the trait until we know it's necessary?

You're right. It doesn't seem like there's any motivation to prefer handling the extreme edge case of out-of-band delete + create at the same path in favor of handling the more common edge case of editing a file out-of-band.

Let's only add configurability when we see that people actually want it.

dlqqq commented 1 year ago

This will include a regression since I believe there may be a few tests that cover this, and those will need to be removed as well.