octue / twined

A library to help data services talk to one another
https://twined.readthedocs.io
MIT License
24 stars 1 forks source link

Use tag templates to validate tag metadata on taggables #79

Open cortadocodes opened 3 years ago

cortadocodes commented 3 years ago

We currently have tags that can have any number of subtags. We are going to move to:

The tags should conform to a tag schema/template that includes:

We may need a custom JSON parser for this.

To do:

cortadocodes commented 3 years ago

@thclark do we want to require that tag templates are provided in twine.json or make it optional? It won't be backwards compatible if we require it, but it will ensure this feature is used.

thclark commented 3 years ago

I think best not to require it. Some files may well not require metadata; in that event we don't really want people to have to add empty sections to twine.json.

cortadocodes commented 3 years ago

What do you think about this naming convention?

An alternative to "tags" could be "custom metadata", which maybe reflects better that some of the tags become labels/keywords while others become attributes of the datafile

thclark commented 3 years ago

Agree re labels rather than keywords, good thought about verb usage (and no longer ambiguous, now that GCS has moved to using "custom metadata" instead of "labels")

Is your suggestion we then retain "tags" as being a superset of labels and custom attributes? Like this:

metadata
    |-> fixed metadata (GCS stuff like content_type)
    |-> custom metadata
          |-> tags
                 |-> labels
                 |-> custom attributes

I'm slightly worried that "attribute" is a word that is meaningful for us, and for python, but is likely not for an amateur programmer or someone who's come from e.g. MATLAB or C++.

What about using tags as an alternative to custom attributes?

metadata
    |-> fixed metadata (GCS stuff like content_type)
    |-> custom metadata
          |-> labels
          |-> tags

Note: The above are taxonomies, not object hierarchies (because of course tags/attributes would be expanded to live directly in custom GCS metadata)

thclark commented 3 years ago

Also, I'm wondering whether it's sensible to namespace fixed octue metadata. Like octue__id rather than id.

cortadocodes commented 3 years ago

Some thoughts in reply:

cortadocodes commented 3 years ago

Also, I'm wondering whether it's sensible to namespace fixed octue metadata. Like octue__id rather than id.

Namespace it in GCS?

thclark commented 3 years ago

Also, I'm wondering whether it's sensible to namespace fixed octue metadata. Like octue__id rather than id.

Namespace it in GCS?

yeah, not on the Datafile objects our side

cortadocodes commented 3 years ago

New requirements

Equivalents in django

Datasets.objects.filter(filesidequals="dfg") Datasets.objects.filter(tagsmanufacturerequals="vestas") Datasets.objects.filter(idin=['one', 'two'], tagsmanufacturer__equals="vestas")


### Example manifest contents:

{ "id": "8ead7669-8162-4f64-8cd5-4abe92509e17", "datasets": [ { "id": "7ead7669-8162-4f64-8cd5-4abe92509e17", "name": "my meteorological dataset", "tags": ["met", "mast", "wind"], "files": [ { "path": "input/datasets/7ead7669/file_1.csv", "cluster": 0, "sequence": 0, "extension": "csv", "labels": ["mykeyword1", "mykeyword2"], "tags": { "manufacturer": "vestas", "height: 500, "is_recycled": true, "number_of_blades": 3, }, "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86", "name": "file_1.csv" }, { "path": "input/datasets/7ead7669/file_1.csv", "cluster": 0, "sequence": 1, "extension": "csv", "tags": ["manufacturer:Zestas", "height:350", "is_recycled:true", "number_of_blades:3"], "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86", "name": "file_1.csv" } ] } ] }