transport-data / tdc-data-portal

https://tdc-data-portal.vercel.app
0 stars 0 forks source link

Design harmonised Metadata schema for the portal #15

Open Uchechukwu-Onye-Igbo opened 2 weeks ago

Uchechukwu-Onye-Igbo commented 2 weeks ago

When creating datasets, organisations and groups, I want to have a standardized metadata schema, so that I can have data consistency across the portal and make datasets easier to understand and use.

Tasklist

Acceptance Criteria

demenech commented 2 weeks ago

Schema: https://docs.google.com/spreadsheets/d/1G_Eq4T5rdolKq3uaFczbLAQfSwFzwwX9XVhjflll5gk/edit?gid=0#gid=0

Fields in green = seem straightforward Fields in yellow = have to be discussed

Next steps

Notes

CC: @Uchechukwu-Onye-Igbo @anuveyatsu @osahon-okungbowa

osahon-okungbowa commented 1 week ago

@luccasmmg @Gutts-n please review this Spreadsheet

Schema: https://docs.google.com/spreadsheets/d/1G_Eq4T5rdolKq3uaFczbLAQfSwFzwwX9XVhjflll5gk/edit?gid=0#gid=0

Fields in green = seem straightforward Fields in yellow = have to be discussed

demenech commented 1 week ago

Questions to the client (raised during sync between me and Anuar):

@Uchechukwu-Onye-Igbo please let's add that as discussion points for our next meeting.

khaeru commented 4 days ago

Hi all—some comments and extra information/requirements here from the TDCI side. I number the points for easier reference:

  1. TDCI has decided to, as far as possible, use SDMX for data and metadata exchange. If the Datopian team are not already familiar (or only partly) with SDMX, please give some indication of your level of knowledge, and I can point to learning resources.
    • As you may or will know, SDMX standards per se provide a general information model (IM) but not specific instructions on how to use it.
    • For instance, the IM gives the notion of a Metadata Structure Definition containing multiple (hierarchical) Metadata Attributes; and then a Metadata Set containing multiple Metadata Reports, each with Reported Attribute values corresponding to the defined structure.
    • But the standards do not say what the metadata attributes should be; that is application-specific.
    • Likewise they give the notion of a Data Structure Definition with certain identified dimensions associated with concepts and code lists, but do not say what dimensions any particular DSD should have.
  2. This is why we have looked to Eurostat as an example of best-practice. I am glad the first two links in the description above have reached you. At https://ec.europa.eu/eurostat/cache/metadata/en/avia_if_esms.htm, in particular:
    • Use the "Download" link to retrieve a ZIP file that contains both SDMX-ML documents for the metadata structure (ESMS_MSD.msd.xml) and one concrete metadata report (avia_if_esms.sdmx.xml), the same one displayed as HTML.
    • The metadata structure includes a scheme defining the concepts like, for example, REL_POL_US_AC, and showing that this should be given as a sub-attribute to REL_POLICY. The metadata report, again, matches this structure.
  3. We anticipate that the detail and sophistication of metadata handled by TDCI will evolve over time:
    • In this early phase, we will use a reduced/simplified but analogous metadata structure, informed by the Eurostat/ESMS example. This is what is being developed at transport-data/tools#21.
    • However, the set of metadata will grow and become more carefully resolved as needs dictate. Over time, TDCI's metadata structure(s) may more closely align with ESMS, have an overlapping set of attributes, and have similar degree of detail.
  4. Based on this anticipated usage, it is important to have or be prepared for interoperability between the CKAN instance and an evolving TDCI SDMX metadata structure. I do not have experience with CKAN, but for example I guess:
    • Metadata attached to particular CKAN records/describing files should map 1:1, and with minimal transformation, to SDMX-structured metadata. So for instance your attached Google spreadsheet ("TDC Metadata Schema") has in cell B12 "API field name: geographies".
      • If this ID, "geographies", is hard-coded in CKAN, then it is important to identify that this is the same as, or distinct from, urn:sdmx:org.sdmx.infomodel.conceptscheme.Concept=ESTAT:SDMX_CDC(3.0).REF_AREA in the XML file mentioned above.
      • If the ID "geographies" is not hard-coded in CKAN, then possibly using REF_AREA or a better choice from existing SDMX applications may be easier.
    • This will ease flowing SDMX (meta)data into CKAN, and also transforming user-generated content/records on the CKAN instance into SDMX for further use.
    • If it is possible to “round-trip” metadata structure information (the CKAN "metadata schema/fields" to an SDMX metadata structure definition and attributes, or vice versa) then we can ensure the two remain closely aligned.
    • If CKAN has particular limitations (e.g. if adding, removing, or altering metadata "fields" is costly or not possible), then it would be good to know these in advance and have a strategy that allows the CKAN schema to keep up with the TDCI SDMX metadata structure.

That is probably a lot to digest, so I will leave it there for now. Happy to respond to clarifying questions.

khaeru commented 4 days ago
  • Sector, service, and mode - What are the possible values for each of these fields?

@demenech for SERVICE and MODE see https://github.com/transportenergy/database/blob/main/item/structure/base.py. These are partial lists and should be eventually moved to/maintained in the transport-data/tools repo. Others include for example the list used by ITF-OECD:

>>> import sdmx
>>> message = sdmx.Client("OECD").get("codelist", "CL_TRANSPORT_MODE")
>>> codelist = message.codelist["CL_TRANSPORT_MODE"]
>>> print("\n".join(map(repr, sorted(cl.items.values()))))
<Code AIR: Air>
<Code COASTAL: Coastal shipping>
<Code HIRING: For hire and reward>
<Code IWW: Waterways>
<Code MAR: Maritime>
<Code MOTORWAYS: Motorways>
<Code OWN: On own account>
<Code PIPE: Pipeline>
<Code RAIL: Rail>
<Code ROAD: Road>
<Code TOT_INL: Inland>
<Code _T: Total>
<Code _Z: Not applicable>
demenech commented 4 days ago

Hi @khaeru

Thank you very much for this very detailed response.

Interoperabity

In terms of interoperability, we are thinking of doing something similar to what ckanext-dcat does with DCAT-AP, but for SDMX following the definition you provided.

ckanext-dcat maps CKAN fields to DCAT fields following this table https://github.com/ckan/ckanext-dcat/blob/master/docs/mapping.md. For example, it specifies that the CKAN tags field is associated with DCAT-AP dcat:keywords field.

Then, ckanext-dcat allows for interoperability by providing a /data.json endpoint with the CKAN datasets metadata converted to DCAT-AP.

For TDC, we would create a ckanext-sdmx extension that behaves in a similar way.

We could use the same metadata schema yaml file that we are using now (https://github.com/transport-data/tdc-data-portal/blob/main/src/ckanext-tdc/ckanext/tdc/schemas/dataset.yaml) to add information on how the conversion to SDMX should behave (e.g. CKAN geographies -> REF_AREA), and ckanext-sdmx would read that and reflect the specification into and endpoint used for SDMX exports.

We believe this approach is very flexible for future extension and aligned with your point:

If it is possible to “round-trip” metadata structure information (the CKAN "metadata schema/fields" to an SDMX metadata structure definition and attributes, or vice versa) then we can ensure the two remain closely aligned.

khaeru commented 4 days ago

@demenech I think that sounds like a great approach and a good way to build on your prior work.

The key distinction, as far as I can see:

So this means one can't map CKAN fields to "SDMX" in general, but to one particular SDMX metadata structure. For now this can be a fixed TDCI metadata structure, but I'm just motivating to future-proof by allowing it to someday be a different/evolved TDCI-defined structure, or one defined by someone else.

demenech commented 4 days ago

@khaeru thanks, it's clear. I was referring to mapping to this application-specific SMDX metadata structure. And definitely, we'll make sure to keep it flexible and future-proof, with supporting technical documentation as well.

Also, one other thing I'm wondering, do you have any sort of automatic validator for the metadata structure of the Eurostat/ESMS example? Perhaps a Python script or something like that?

khaeru commented 4 days ago

Not exactly, but I will try to whip up an example this morning.

EDIT: Here https://gist.github.com/khaeru/1d386e4c35d561e2bf7dfd18249071f3

demenech commented 3 days ago

great stuff @khaeru Thank you very much!