Open cortadocodes opened 3 years ago
@thclark do we want to require that tag templates are provided in twine.json
or make it optional? It won't be backwards compatible if we require it, but it will ensure this feature is used.
I think best not to require it. Some files may well not require metadata; in that event we don't really want people to have to add empty sections to twine.json.
What do you think about this naming convention?
Tags
- any key-value pair or keyword added to a file, of which:
Labels
- keyword tags become labels (I prefer "label" over "keyword" because it's a verb as well as a noun - you can label a file but you can't keyword a file)Custom attributes
- key-value tags become custom attributesAn alternative to "tags" could be "custom metadata", which maybe reflects better that some of the tags become labels/keywords while others become attributes of the datafile
Agree re labels rather than keywords, good thought about verb usage (and no longer ambiguous, now that GCS has moved to using "custom metadata" instead of "labels")
Is your suggestion we then retain "tags" as being a superset of labels and custom attributes? Like this:
metadata
|-> fixed metadata (GCS stuff like content_type)
|-> custom metadata
|-> tags
|-> labels
|-> custom attributes
I'm slightly worried that "attribute" is a word that is meaningful for us, and for python, but is likely not for an amateur programmer or someone who's come from e.g. MATLAB or C++.
What about using tags as an alternative to custom attributes?
metadata
|-> fixed metadata (GCS stuff like content_type)
|-> custom metadata
|-> labels
|-> tags
Note: The above are taxonomies, not object hierarchies (because of course tags/attributes would be expanded to live directly in custom GCS metadata)
Also, I'm wondering whether it's sensible to namespace fixed octue metadata. Like octue__id
rather than id
.
Some thoughts in reply:
custom metadata
because it would separately refer to GCS custom metadata and the superset {labels, tags}
tags
for key-value pairs, then I think we should rename the Tag
class to Label
or we'd have two different meanings for tag
(and also taggable
to labellable
or something)tag
really captures the key-value nature of the required tags? I suppose it's because my main language is python, but attribute
infers to me a key-value nature i.e. a name and a valueAlso, I'm wondering whether it's sensible to namespace fixed octue metadata. Like
octue__id
rather thanid
.
Namespace it in GCS?
Also, I'm wondering whether it's sensible to namespace fixed octue metadata. Like
octue__id
rather thanid
.Namespace it in GCS?
yeah, not on the Datafile objects our side
$defs.tags_template.properties.properties
are not of type array or object so that they are flatDatafile(path="here", id="123", tags={**stuff_from_somewhere}, labels=['one','two','three'])
NB can reuse the instantiations from json src like for input_values etc (reduced number of ways to instantiate)
# Example filtering syntax
dataset.files.filter(tags__manufacturer__equals="vestas")
dataset.files.filter(labels__contains="mykeyword1")
Datasets.objects.filter(filesidequals="dfg") Datasets.objects.filter(tagsmanufacturerequals="vestas") Datasets.objects.filter(idin=['one', 'two'], tagsmanufacturer__equals="vestas")
### Example manifest contents:
{ "id": "8ead7669-8162-4f64-8cd5-4abe92509e17", "datasets": [ { "id": "7ead7669-8162-4f64-8cd5-4abe92509e17", "name": "my meteorological dataset", "tags": ["met", "mast", "wind"], "files": [ { "path": "input/datasets/7ead7669/file_1.csv", "cluster": 0, "sequence": 0, "extension": "csv", "labels": ["mykeyword1", "mykeyword2"], "tags": { "manufacturer": "vestas", "height: 500, "is_recycled": true, "number_of_blades": 3, }, "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86", "name": "file_1.csv" }, { "path": "input/datasets/7ead7669/file_1.csv", "cluster": 0, "sequence": 1, "extension": "csv", "tags": ["manufacturer:Zestas", "height:350", "is_recycled:true", "number_of_blades:3"], "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86", "name": "file_1.csv" } ] } ] }
We currently have tags that can have any number of subtags. We are going to move to:
Taggable
instanceThe tags should conform to a tag schema/template that includes:
We may need a custom JSON parser for this.
To do:
filter
field from manifest filesadditionalProperties
needs to betrue
for tags in the schema to allow extra metadata on files