jacebrowning / datafiles

A file-based ORM for Python dataclasses.
https://datafiles.readthedocs.io
MIT License
198 stars 18 forks source link

Allow synchronization of filename attributes #303

Open tbabej opened 1 year ago

tbabej commented 1 year ago

Currently, datafiles handles any attributes that are part of the filepath as not being stored in the file itself. Personally I consider this a good design, as duplicating the value would potentially cause issues if the "filename" and "file content" values did not agree.

However, it seems we are unable to preserve any changes to the attributes that are part of the filename, effectively rendering them read-only. Consider a following datafile called JohnDoe.yml:

age: 42

and an associated short script:

@datafiles.datafile("{self.name}.yml", defaults=True)
class Person:
    name: str
    age: int

person = list(Person.objects.all())[0]
print(person)

As expected, this will produce the following output:

Person(name='JohnDoe', age=42)

However, if we continue to modify the model instance:

person.name="BobBuilder"
person.age=24

we can observe on the filesystem that the file has been modified:

age: 24

however, the filename itself has not been changed, it is still JohnDoe.yml, resulting in the following data:

Person(name='JohnDoe', age=24)

which does not align itself with the expected output from the viewpoint of the user. I would argue we should support one of the two options: a) the "filename" attributes should be frozen and not allowed to be modified b) or we should allow their modification, but rename the files, effectively preserving their modified value in the filename

If renaming is too drastic for it to be the default, we could perhaps have (a) as the default behaviour and allow (b) via an optional rename=True flag (similar how we handle defaults=True).

jacebrowning commented 1 year ago

Thanks for the clear bug report!

b) or we should allow their modification, but rename the files, effectively preserving their modified value in the filename

I'd worry that having Datafiles rename files will cause all sorts of unforeseen problems. For example, what if two processes have the file open? If manual=True is set then the attribute and filename will potentially disagree.

a) the "filename" attributes should be frozen and not allowed to be modified

On first pass, I think this is preferable and filename attributes should always be frozen.

If a filename attribute is modified, what should happen? Ignore that attribute's changed value or raise an exception? This feels potentially related to https://github.com/jacebrowning/datafiles/pull/294.

tbabej commented 1 year ago

In general I agree that (a) is preferable as the default option. In particular, your suggested solution of ignoring the change and raising an exception sounds like the right solution here.

That said, I believe there are valid use cases for changing the "filename" attributes and having them reflect in the filename - especially if the behaviour is opt-in via a rename=True flag.

The concern about two processes accessing the same file is valid, but I never felt like that was a fully supported use case - I suspect the users can already mangle the data by having the processes overwriting each the same datafiles (i.e. if using manual=True for performance reasons). We could avoid this issue and the problem with renaming by introducing a filesystem-level locking mechanism via something like fcntl.flock(f, fcntl.LOCK_EX | fcntl.LOCK_NB), which could prevent both the overwriting and renaming.

If all of this sounds acceptable, I would volunteer to implement the required changes. I already have a preliminary version of (b) implemented, and I can also contribute (a) and potentially file locking if that's what it would take to get the changes merged upstream :slightly_smiling_face:

jacebrowning commented 1 year ago

I already have a preliminary version of (b) implemented

I'd like to see what that looks like in a PR!

Given that the defaults=True setting combined with modifying filename attributes is already a fairly uncommon scenario, perhaps my concerns over file contention are exaggerated. 😄

tbabej commented 1 year ago

@jacebrowning I put up the preliminary changes, have a look at #304 :slightly_smiling_face:

Given that the defaults=True setting combined with modifying filename attributes is already a fairly uncommon scenario, perhaps my concerns over file contention are exaggerated.

I wouldn't discount the possibility, I really do think this is a valid concern. But as I mentioned, we currently lack the mechanism to prevent data corruption in the scenario of multiple processes accessing the same data files. I haven't tested it, but I am fairly certain that with sufficiently high amount of simultaneous processes reading and modifying the same data files, even the current state of things would lead to data corruption. But that's perhaps a discussion for a different issue :slightly_smiling_face: