Open kgpayne opened 2 years ago
There are ways to make this work but it's not well documented.
Internally, datafiles
replaces /*/
with /**/
in patterns for searching arbitrary depths with iglob()
:
So, if you include part of the path in pattern
with a default value of *
then model.objects.all()
should find all matching config files and set the partial path
attribute on loaded instances. Here's an example of me doing that in another library that uses datafiles
:
Let me know if that works for you! I think the feature needs to be made more explicit and documented.
Thanks for getting back to me! I have a really basic implementation working (using a simple my_project/{self.name).yaml
pattern) but adding a /*/
to my pattern didn't work 🤔 Still, to complicate matters there are multiple kinds of yaml config in the folder hierarchy. We are trying to allow our users to break up one large project.yaml
file into a parent project.yaml
and an arbitrary number of child configs referenced as glob patterns under a key in the parent project.yaml
. Here is a paired-down example:
# my_project/project.yaml
include_paths:
- '**.yaml' # list of discovered files will always exclude the statically-located project.yaml at the root of the project to avoid duplication
plugins:
extractors:
- name: project-tap-1
variant: meltano
# my_project/team_one/subfile_1.yaml
plugins:
extractors:
- name: subfile-1-tap-1
variant: custom
# my_project/subfile_2.yaml
plugins:
extractors:
- name: subfile-2-tap-1
variant: custom
# all plain dataclasses
from .base import ConfigBase, ExtractorConfig, LoaderConfig, ScheduleConfig
@dataclass
class Plugins:
extractors: List[ExtractorConfig] = field(default_factory=list)
loaders: List[LoaderConfig] = field(default_factory=list)
@dataclass
class MeltanoFile:
plugins: Plugins = Plugins()
schedules: List[ScheduleConfig] = field(default_factory=list)
include_paths: List[str] = field(default_factory=list)
version: int = 1
@dataclass
class SubFile:
plugins: Plugins = Plugins()
schedules: List[ScheduleConfig] = field(default_factory=list)
I wan't to be able to take over responsibility for discovering the 'root' project.yaml
file and then, using the glob patterns in include_paths
, discovering any matching file paths and passing them to SubFile
to be mapped. Does that make sense?
If this is possible, we can then build a Project
class to index plugins (in this case extractors) and provide a CRUD interface to modify plugin config wherever the actual files are in the project hierarchy 😅
The way I am thinking about this is conceptually similar to how SQLAlchemy's Classical Mapper works. Object and persistence defined separately and then explicitly mapped 🙂 Ideally the schema and converters would be attached to a File
class, with instances representing individual files on the filesystem. Then, by Mapping one of the file instances to a Dataclass with matching attribute names/types you get a mutable python object who's changes are reflected on disk.
It looks like datafiles
is doing this under the hood, but I can't figure out the Mapping step.
but adding a /*/ to my pattern didn't work
I'd be curious to see more sample code of what you tried and the result.
and then, using the glob patterns in
include_paths
, discovering any matching file paths
For that, you could possibly use create_model directly:
from datafiles.model import create_model
parent_config = MeltanoFile(name='project')
for pathname in _iterate_globs(parent_config.include_paths):
model = create_model(SubFile, pattern=pathname)
child_config = model() # 'pattern' should only match a single file
Glad we are on the same lines - I tried create_model
first before playing with subclassing. Here is the full poc codebase, with a notebook I have been working in. For the project file (meltano.yaml) which is only instantiated once, everything works as expected. However the subfiles are garbled - e.g. the subfile_1.datafile.text
of subfile_1
is a strange concatenation of the nested objects from both subfile_1
and subfile_3
even though the path
is correct 🤔 This of course means the written information on .save()
is incorrect. Hopefully this is just a bug with nested objects and this use case isn't as far out of scope as I imagined 🤞
Since create_model
patches the class, I could see how calling it multiple times with the same class could create strange results -- the expectation is that pattern
defines all possible instances' files.
Hopefully this is just a bug with nested objects
To confirm that perhaps you could try pairing down SubFile
to only include builtin types?
Is there a way to map existing files with the same schema that do not match a repeatable pattern on disk to a datafiles Model instance manually? The use case is config files spread across arbitrary-depth subfolders below a top-level project directory. Using glob I can find the files I am interested in mapping, but I am not having much success creating mapped instances of those discovered files.
I have tried:
Model.Meta.dataclass_pattern
with each discovered files path and callingModel.objects.get()
pattern
defined and then overriding both theinstance.Meta.datafiles_pattern
andinstance.datafile.path
attributes on the instance, with the correct path for the discovered file, before callinginstance.datafile.load()
.However in both cases this results in an odd behaviour where all instances with nested attributes contain pointers to the most recently loaded files' nested object rather than their own 🤦♂️
Is this a completely unsupported use-case, or is there another way to use
datafiles
to map files discovered outside of the supported 'pattern' construct to instances of a datafiles Model? Thank you!