Closed dlqqq closed 1 year ago
Sounds good to me, I do think that having more universal support for actual creation times will help.
On Fri, Oct 28, 2022 at 3:41 PM david qiu @.***> wrote:
cc @ellisonbg https://github.com/ellisonbg @davidbrochart https://github.com/davidbrochart @kevin-bates https://github.com/kevin-bates @afshin https://github.com/afshin Problem
Right now, if the filesystem does not support creation timestamps, we default to looking at the modified timestamps to verify the file is really the same file. We do this because relying on an inode number alone doesn't provide a strong guarantee of identity. For example, let's say we index the file foo with an ID of 1 (not using UUIDs here for brevity).
rm foo touch foo
This sequence of out-of-band creates and deletes generally preserves the inode number, but in reality the new file is completely different from the old file. Thus, the current implementation verifies that both the inode number and the modified timestamps are identical to the one recorded previously before assigning the same file ID to a given path. So here, file_id_manager.get_path(1) correctly returns None. We term this behavior mtime fallback.
However, one issue with mtime fallback is that out-of-band edits result in the file no longer being associated with its file ID. For example, editing foo via vim results in file_id_manager.get_path(1) returns None, even though foo is still the same file. Furthermore, the file ID 1 gets completely dropped from the database and never used again.
Originally, the initial design was to make get_path() as accurate as possible, because we viewed that incorrectly associating a file ID with a different file would be a bad outcome. However, as we are using file ID more, we are noticing that incorrectly dropping a file ID is an even worse outcome, as it can lead to catastrophic loss of data associated with a file ID. As @kevin-bates https://github.com/kevin-bates puts it:
You're choosing this "weird edge case" (that also requires the new file to have the same inode as the old file!) over simple (everyday) modifications. Could you forgo the timestamp checks against modification time when creation time is not available and you know the inodes are equal? All references should not be orphaned because a file was modified and the user happened to be on some filesystem that doesn't track crtime.
While #40 https://github.com/jupyter-server/jupyter_server_fileid/issues/40 would eliminate this problem for ext4, ext3 and other filesystems without creation timestamps would still have this issue. Proposed Solution
Add a configurable boolean trait called mtime_fallback on LocalFileIdManager, defaulting to False. Only perform mtime_fallback if this trait is True.
— Reply to this email directly, view it on GitHub https://github.com/jupyter-server/jupyter_server_fileid/issues/41, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAGXUCP7SHFTQFG6ODN6UTWFRJDBANCNFSM6AAAAAARRPGONE . You are receiving this because you were mentioned.Message ID: @.***>
-- Brian E. Granger
Senior Principal Technologist, AWS AI/ML @.***) On Leave - Professor of Physics and Data Science, Cal Poly @ellisonbg on GitHub
If the default is to only use creation times (when available) and given the expense of adding a configurable trait[*] that can likely introduce wrong behavior, should we defer adding the trait until we know it's necessary?
For example, to enable this trait, the user is saying to treat an out-of-band modified file as a NEW file. For the case in which the file had been deleted and recreated with the SAME name, and that the inode was unchanged, one could easily argue that the user's intention was to preserve the meaning of the file (by retaining the name), and they're just "starting over". But, simply modifying the file, should never resort in the orphaning of all external references.
I suspect there will be other edge cases that will be more severe (e.g., hard and soft links, other filesystem types) that can be focused on for now.
[*]: Adding a configurable trait is expensive. It requires documentation, is one more "knob" a user needs to fully understand (especially in this case) and requires a deprecation cycle to be removed.
should we defer adding the trait until we know it's necessary?
You're right. It doesn't seem like there's any motivation to prefer handling the extreme edge case of out-of-band delete + create at the same path in favor of handling the more common edge case of editing a file out-of-band.
Let's only add configurability when we see that people actually want it.
This will include a regression since I believe there may be a few tests that cover this, and those will need to be removed as well.
cc @ellisonbg @davidbrochart @kevin-bates @afshin
Problem
Right now, if the filesystem does not support creation timestamps, we default to looking at the modified timestamps to verify the file is really the same file. We do this because relying on an inode number alone doesn't provide a strong guarantee of identity. For example, let's say we index the file
foo
with an ID of1
(not using UUIDs here for brevity).This sequence of out-of-band creates and deletes generally preserves the inode number, but in reality the new file is completely different from the old file. Thus, the current implementation verifies that both the inode number and the modified timestamps are identical to the one recorded previously before assigning the same file ID to a given path. So here,
file_id_manager.get_path(1)
correctly returnsNone
. We term this behavior mtime fallback.However, one issue with mtime fallback is that out-of-band edits result in the file no longer being associated with its file ID. For example, editing
foo
viavim
results infile_id_manager.get_path(1)
returnsNone
, even thoughfoo
is still the same file. Furthermore, the file ID1
gets completely dropped from the database and never used again.Originally, the initial design was to make
get_path()
as accurate as possible, because I viewed that incorrectly associating a file ID with a different file would be a bad outcome. However, as we are using file ID more, we are noticing that incorrectly dropping a file ID is an even worse outcome, as it can lead to catastrophic loss of data associated with a file ID. As @kevin-bates puts it:While #40 would eliminate this problem for
ext4
,ext3
and other filesystems without creation timestamps would still have this issue.Proposed Solution
Add a configurable boolean trait called
mtime_fallback
onLocalFileIdManager
, defaulting toFalse
. Only performmtime_fallback
if this trait isTrue
.