Update models to match new project structure

martonvago commented 2 months ago

Update models to match the new setup where metadata and data files are stored on the disk as part of core.

Establish the correspondence between frictionless metadata schemas (data package, data resource, table) and our current Django models.
Create custom properties in frictionless metadata schemas matching current model properties.
Have some logic that loads properties from frictionless metadata schemas into Django models (so we can access them as if they were coming from the database).

martonvago commented 1 month ago

Notes:

Frictionless allows us to load the metadata for each project very easily. We just load the datapackage.json file in the project folder using the path to the folder, for example.
datapackage.json is parsed into a Package object containing a list of Resources. Each Resource has a schema with a list of Fields. Each Field defines the (data)type of the Field and any constraints (required, unique etc).
This suggests the following correspondence between our Django models and classes in frictionless:
- Tables = Resource + custom properties is_draft, created_at, created_by, modified_at, modified_by, last_data_upload, data_rows, files
- Files = the objects listed under the files property above
- Columns = Field+ custom property machine_readable_name
- DataType = type property on Field
We could have a folder structure for data and metadata where, within a secure-storage folder (or something equivalent), we have separate folders for each data project. Inside each data project folder we would have a single datapackage.json file and all the data files (e.g. my-data1.csv).
Then, in the Django app, we could get away with storing only the data projects in a database table with a path to the project folder. All other metadata would live in datapackage.json. If the folder containing all data projects is known / configurable in advance, then maybe we don't even have to have a list of projects in the database.

Questions:

Should we use json or yaml metadata files?
Can a Table be composed of multiple Files? Most of the code suggests yes, but sometimes only one File is expected for a Table.
Why does Table have an original_file_name property when File also has this property and, pending the question above, one Table can be composed of multiple Files?
To what extent do we want to handle Resources and data Files separately? Can a Package have a Resource without a corresponding data File? And vice versa, can a data File exist in a project folder without it being listed as a Resource in the Package?
Is putting data file information into a files array on Resource acceptable?
Can we use name on Resource as a unique identifier for theResource? We currently require it to be unique anyway.
If we include custom properties (i.e. properties not in the frictionless standard by default) in our metadata, then we’ll be faced with the question of how we want this custom data to be parsed / validated. The default behaviour is that custom properties are accessible under a custom property and are not validated when the metadata file is loaded from datapackage.json. We could try mixing the frictionless metadata classes with e.g. pydantic dataclasses to get both validation for custom properties and frictionless functionality. The downside is that this would be a bit complex, especially because we have custom properties all the way down the nested schema structure. The same question arises when we construct these objects programmatically: do we just add custom properties under custom or do we want a more involved setup?
Should we move over to frictionless classes completely? If so, we will be dropping all of our models.
Should we move over to frictionless terminology completely? E.g. use "package" instead of "project" and "resource" instead of "metadata" everywhere, including URLs, template names etc.?
Are we really okay with not storing metadata in a database? Are there tools enabling us to get nice change log and audit information using this type of storage? Are there ways of rolling back changes or backing up data to be able to recover it if needed?
How do we make our "secure storage" secure and how do we handle access control?

List of subtasks:

Add an enum for frictionless data types. These are not defined in a single place in frictionless, so we'd just copy them from one of the places they're defined.
In core: create a (non-Django) File (data)class to store the file information currently stored in models.Files. A Resource will have a list of Files.
In app: replace TablesForm with ResourceForm and ColumnsForm with FieldForm. The new forms will still be Django forms but they will not be linked to a Django model. Instead, all form fields will need to be listed individually.
In core and app: update the way metadata is accessed. Whenever the data is fetched from the database through a Django model, fetch the data by loading the appropriate datapackage.json instead. E.g. move from Columns.objects.select_related("data_type").filter(tables=tables) to something along the lines of package.resources.find(name=name).schema.fields. Put the general-purpose bit of the logic into core, the rest into app.
In core and app: update the way metadata is saved. Instead of using Django models to save to the database, find, load, update and write to file the appropriate datapackage.json. Put the general-purpose bit of the logic into core, the rest into app.

martonvago commented 1 month ago

I put some foundational work on https://github.com/seedcase-project/seedcase-sprout/tree/refactor/update-models

martonvago commented 1 month ago

The branch above now contains more of a detailed exploration of how I understood things should work in the new setup. It's in no way complete (I haven't worked through the entire stepper form yet) and I ignored tests and doc strings (because it'll all change anyway). It also has some bugs etc., so at any given commit it might not actually build, but it's still useful as a point of departure / conversation starter, I think.

signekb commented 1 month ago

Very nice work, @martonvago 🔥 🔥 ! There's a lot of questions here. I'm not sure whether you and @lwjohnst86 have already discussed these, but maybe it would make sense to go through them at tomorrow's status meeting (answering/closing them or creating discussion issues for those we don't have answers for right now)? I think this will help all of us in the upcoming weeks!

lwjohnst86 commented 1 month ago

Very nice! Some of the comments are covered already by the naming scheme for project files (see https://github.com/seedcase-project/seedcase-sprout/blob/main/docs/design/naming.qmd)

lwjohnst86 commented 1 month ago

@martonvago not sure if you are still working on this, but given the focus on core functions rather than Django, I will move this out of this iteration since it isn't relevant right now. It will be relevant later though!! So keeping it open.

martonvago commented 1 month ago

Sure, good idea, I haven't done anything with it since I came back. And the outcome of the new reviews (🔥) will influence how exactly this will be done anyway!

seedcase-project / seedcase-sprout

Update models to match new project structure #512