jacebrowning / datafiles

A file-based ORM for Python dataclasses.
https://datafiles.readthedocs.io
MIT License
198 stars 18 forks source link

Best way to map existing files to Model instances #236

Open kgpayne opened 2 years ago

kgpayne commented 2 years ago

Is there a way to map existing files with the same schema that do not match a repeatable pattern on disk to a datafiles Model instance manually? The use case is config files spread across arbitrary-depth subfolders below a top-level project directory. Using glob I can find the files I am interested in mapping, but I am not having much success creating mapped instances of those discovered files.

I have tried:

However in both cases this results in an odd behaviour where all instances with nested attributes contain pointers to the most recently loaded files' nested object rather than their own 🤦‍♂️

Is this a completely unsupported use-case, or is there another way to use datafiles to map files discovered outside of the supported 'pattern' construct to instances of a datafiles Model? Thank you!

jacebrowning commented 2 years ago

There are ways to make this work but it's not well documented.

Internally, datafiles replaces /*/ with /**/ in patterns for searching arbitrary depths with iglob():

https://github.com/jacebrowning/datafiles/blob/124ee315edeb14d90eafd6efd7de9fa594984283/datafiles/manager.py#L80-L86

So, if you include part of the path in pattern with a default value of * then model.objects.all() should find all matching config files and set the partial path attribute on loaded instances. Here's an example of me doing that in another library that uses datafiles:

https://github.com/jacebrowning/pomace/blob/6511f04e502c5980f1172504ee5dc35224524c79/pomace/models.py#L294-L301


Let me know if that works for you! I think the feature needs to be made more explicit and documented.

kgpayne commented 2 years ago

Thanks for getting back to me! I have a really basic implementation working (using a simple my_project/{self.name).yaml pattern) but adding a /*/ to my pattern didn't work 🤔 Still, to complicate matters there are multiple kinds of yaml config in the folder hierarchy. We are trying to allow our users to break up one large project.yaml file into a parent project.yaml and an arbitrary number of child configs referenced as glob patterns under a key in the parent project.yaml. Here is a paired-down example:

# my_project/project.yaml
include_paths:
  - '**.yaml'  # list of discovered files will always exclude the statically-located project.yaml at the root of the project to avoid duplication
plugins:
  extractors:
    - name: project-tap-1
      variant: meltano
# my_project/team_one/subfile_1.yaml
plugins:
  extractors:
    - name: subfile-1-tap-1
      variant: custom
# my_project/subfile_2.yaml
plugins:
  extractors:
    - name: subfile-2-tap-1
      variant: custom
# all plain dataclasses
from .base import ConfigBase, ExtractorConfig, LoaderConfig, ScheduleConfig

@dataclass
class Plugins:
    extractors: List[ExtractorConfig] = field(default_factory=list)
    loaders: List[LoaderConfig] = field(default_factory=list)

@dataclass
class MeltanoFile:
    plugins: Plugins = Plugins()
    schedules: List[ScheduleConfig] = field(default_factory=list)
    include_paths: List[str] = field(default_factory=list)
    version: int = 1

@dataclass
class SubFile:
    plugins: Plugins = Plugins()
    schedules: List[ScheduleConfig] = field(default_factory=list)

I wan't to be able to take over responsibility for discovering the 'root' project.yaml file and then, using the glob patterns in include_paths, discovering any matching file paths and passing them to SubFile to be mapped. Does that make sense?

If this is possible, we can then build a Project class to index plugins (in this case extractors) and provide a CRUD interface to modify plugin config wherever the actual files are in the project hierarchy 😅

kgpayne commented 2 years ago

The way I am thinking about this is conceptually similar to how SQLAlchemy's Classical Mapper works. Object and persistence defined separately and then explicitly mapped 🙂 Ideally the schema and converters would be attached to a File class, with instances representing individual files on the filesystem. Then, by Mapping one of the file instances to a Dataclass with matching attribute names/types you get a mutable python object who's changes are reflected on disk.

It looks like datafiles is doing this under the hood, but I can't figure out the Mapping step.

jacebrowning commented 2 years ago

but adding a /*/ to my pattern didn't work

I'd be curious to see more sample code of what you tried and the result.

and then, using the glob patterns in include_paths, discovering any matching file paths

For that, you could possibly use create_model directly:

from datafiles.model import create_model

parent_config = MeltanoFile(name='project')

for pathname in _iterate_globs(parent_config.include_paths):
    model = create_model(SubFile, pattern=pathname)
    child_config = model()  # 'pattern' should only match a single file
kgpayne commented 2 years ago

Glad we are on the same lines - I tried create_model first before playing with subclassing. Here is the full poc codebase, with a notebook I have been working in. For the project file (meltano.yaml) which is only instantiated once, everything works as expected. However the subfiles are garbled - e.g. the subfile_1.datafile.text of subfile_1 is a strange concatenation of the nested objects from both subfile_1 and subfile_3 even though the path is correct 🤔 This of course means the written information on .save() is incorrect. Hopefully this is just a bug with nested objects and this use case isn't as far out of scope as I imagined 🤞

jacebrowning commented 2 years ago

Since create_model patches the class, I could see how calling it multiple times with the same class could create strange results -- the expectation is that pattern defines all possible instances' files.

Hopefully this is just a bug with nested objects

To confirm that perhaps you could try pairing down SubFile to only include builtin types?