Open martonvago opened 2 months ago
Notes:
datapackage.json
file in the project folder using the path to the folder, for example.datapackage.json
is parsed into a Package
object containing a list of Resource
s. Each Resource
has a schema with a list of Field
s. Each Field
defines the (data)type
of the Field
and any constraints
(required
, unique
etc).Tables
= Resource
+ custom properties is_draft
, created_at
, created_by
, modified_at
, modified_by
, last_data_upload
, data_rows
, files
Files
= the objects listed under the files
property aboveColumns
= Field
+ custom property machine_readable_name
DataType
= type
property on Field
secure-storage
folder (or something equivalent), we have separate folders for each data project. Inside each data project folder we would have a single datapackage.json
file and all the data files (e.g. my-data1.csv
).app
, we could get away with storing only the data projects in a database table with a path to the project folder. All other metadata would live in datapackage.json
. If the folder containing all data projects is known / configurable in advance, then maybe we don't even have to have a list of projects in the database.Questions:
json
or yaml
metadata files?Table
be composed of multiple File
s? Most of the code suggests yes, but sometimes only one File
is expected for a Table
.Table
have an original_file_name
property when File
also has this property and, pending the question above, one Table
can be composed of multiple File
s?Resource
s and data File
s separately? Can a Package
have a Resource
without a corresponding data File
? And vice versa, can a data File
exist in a project folder without it being listed as a Resource
in the Package
?files
array on Resource
acceptable?name
on Resource
as a unique identifier for theResource
? We currently require it to be unique anyway.custom
property and are not validated when the metadata file is loaded from datapackage.json
. We could try mixing the frictionless metadata classes with e.g. pydantic dataclasses to get both validation for custom properties and frictionless functionality. The downside is that this would be a bit complex, especially because we have custom properties all the way down the nested schema structure. The same question arises when we construct these objects programmatically: do we just add custom properties under custom
or do we want a more involved setup?List of subtasks:
enum
for frictionless data types
. These are not defined in a single place in frictionless, so we'd just copy them from one of the places they're defined.core
: create a (non-Django) File
(data)class to store the file information currently stored in models.Files
. A Resource
will have a list of File
s.app
: replace TablesForm
with ResourceForm
and ColumnsForm
with FieldForm
. The new forms will still be Django forms but they will not be linked to a Django model. Instead, all form fields will need to be listed individually.core
and app
: update the way metadata is accessed. Whenever the data is fetched from the database through a Django model, fetch the data by loading the appropriate datapackage.json
instead. E.g. move from Columns.objects.select_related("data_type").filter(tables=tables)
to something along the lines of package.resources.find(name=name).schema.fields
. Put the general-purpose bit of the logic into core
, the rest into app
.core
and app
: update the way metadata is saved. Instead of using Django models to save to the database, find, load, update and write to file the appropriate datapackage.json
. Put the general-purpose bit of the logic into core
, the rest into app
.I put some foundational work on https://github.com/seedcase-project/seedcase-sprout/tree/refactor/update-models
The branch above now contains more of a detailed exploration of how I understood things should work in the new setup. It's in no way complete (I haven't worked through the entire stepper form yet) and I ignored tests and doc strings (because it'll all change anyway). It also has some bugs etc., so at any given commit it might not actually build, but it's still useful as a point of departure / conversation starter, I think.
Very nice work, @martonvago 🔥 🔥 ! There's a lot of questions here. I'm not sure whether you and @lwjohnst86 have already discussed these, but maybe it would make sense to go through them at tomorrow's status meeting (answering/closing them or creating discussion issues for those we don't have answers for right now)? I think this will help all of us in the upcoming weeks!
Very nice! Some of the comments are covered already by the naming scheme for project files (see https://github.com/seedcase-project/seedcase-sprout/blob/main/docs/design/naming.qmd)
@martonvago not sure if you are still working on this, but given the focus on core functions rather than Django, I will move this out of this iteration since it isn't relevant right now. It will be relevant later though!! So keeping it open.
Sure, good idea, I haven't done anything with it since I came back. And the outcome of the new reviews (🔥) will influence how exactly this will be done anyway!
Update models to match the new setup where metadata and data files are stored on the disk as part of
core
.