khusmann / frictionless-categorical-examples

A repo for holding example frictionless datapackages to aid in our development of categorical types.
https://datapackage.org/
Apache License 2.0
0 stars 0 forks source link

Remarks on 'Table-based categoricals' #2

Open djvanderlaan opened 3 weeks ago

djvanderlaan commented 3 weeks ago

https://github.com/khusmann/frictionless-categorical-examples/tree/main/table-based-categoricals#some-theoretical-questions

I believe the default for 'valueField' should be 'value' as it is in the current categories specification.

Points 5 and 6

I also ran into this. I already had an implementation based on an earlier proposal. There the fields used for the categories were "code" and "label", while the current implementation uses "value" and "label". When changing the code lists to use "value" instead of "code" I ran into the issue that I had to track down all fields referencing these code lists and changing the "valueLabel" field there. I would have been easier and more self-contained if the data resource for the code lists itself would indicate which field contains the values and which the labels. However, this would mean extending not just the 'categories' property of the field descriptor but also the definition of Data Resource. Personally, I am in favour of this.

Additional point: referencing external data packages

It would be nice if we could not just reference a Data Resource in the current Data Package but also a Data Resource in another Data Package. So:

{
    "resource": <resource name>,
    "datapackage": <datapackageurl>,
    "valueField": <field name>,
    "labelField"?: <label field name>
}

This would allow referencing central catalogues of category lists. The disadvantage is that this makes the data package less self-contained and might lead to problems with link rot.

khusmann commented 3 weeks ago

Thanks for these excellent points! I've updated the table-based category examples to the new pattern we discussed in the meeting today, that is, resource with the following additional properties:

{
  type: "category-table",
  categoriesOrdered"?: boolean (default = undefined),
  valueField?: string (default = "value"),
  labelField?: string (default = "label")
}

Then in the categorical field being defined,

{
  categories: {
    package?: string,
    resource: string
  }
}

(Where optional property package uses the same rules as external foreign keys: https://datapackage.org/recipes/external-foreign-keys/)

khusmann commented 3 weeks ago

For references within a package, do we want to simplify the categories field definition to just take a string? Then we could have fields like this:

{
  "name": "fieldname",
  "type": "string",
  "categories": "category-table-resource-name"
}

An object would only be necessary when we wanted to specify a category-table in an external package...

djvanderlaan commented 4 days ago

That would be shorter and most of the times the resource will be in the same package as the dataset; however the categories property might be getting a bit overloaded. It can then be

  1. An array of strings: "categories": ["apple", "pear","orange"]
  2. An array of objects: "categories": [ {"value": 1, "label":"apple"}, ...]
  3. An object "categories":{ "resource": "fruittypes"}
  4. A string "categories": "fruittypes"

And as someone working in R: it is tricky to distinguish options 1 and 4. I would personally prefer sticking with just option 3 (besides 1 and 2); option 4 saves us just 14 characters and adds to the complexity.