Open Uchechukwu-Onye-Igbo opened 2 weeks ago
Schema: https://docs.google.com/spreadsheets/d/1G_Eq4T5rdolKq3uaFczbLAQfSwFzwwX9XVhjflll5gk/edit?gid=0#gid=0
Fields in green = seem straightforward Fields in yellow = have to be discussed
CC: @Uchechukwu-Onye-Igbo @anuveyatsu @osahon-okungbowa
@luccasmmg @Gutts-n please review this Spreadsheet
Schema: https://docs.google.com/spreadsheets/d/1G_Eq4T5rdolKq3uaFczbLAQfSwFzwwX9XVhjflll5gk/edit?gid=0#gid=0
Fields in green = seem straightforward Fields in yellow = have to be discussed
Questions to the client (raised during sync between me and Anuar):
@Uchechukwu-Onye-Igbo please let's add that as discussion points for our next meeting.
Hi all—some comments and extra information/requirements here from the TDCI side. I number the points for easier reference:
ESMS_MSD.msd.xml
) and one concrete metadata report (avia_if_esms.sdmx.xml
), the same one displayed as HTML.That is probably a lot to digest, so I will leave it there for now. Happy to respond to clarifying questions.
- Sector, service, and mode - What are the possible values for each of these fields?
@demenech for SERVICE and MODE see https://github.com/transportenergy/database/blob/main/item/structure/base.py. These are partial lists and should be eventually moved to/maintained in the transport-data/tools repo. Others include for example the list used by ITF-OECD:
>>> import sdmx
>>> message = sdmx.Client("OECD").get("codelist", "CL_TRANSPORT_MODE")
>>> codelist = message.codelist["CL_TRANSPORT_MODE"]
>>> print("\n".join(map(repr, sorted(cl.items.values()))))
<Code AIR: Air>
<Code COASTAL: Coastal shipping>
<Code HIRING: For hire and reward>
<Code IWW: Waterways>
<Code MAR: Maritime>
<Code MOTORWAYS: Motorways>
<Code OWN: On own account>
<Code PIPE: Pipeline>
<Code RAIL: Rail>
<Code ROAD: Road>
<Code TOT_INL: Inland>
<Code _T: Total>
<Code _Z: Not applicable>
Hi @khaeru
Thank you very much for this very detailed response.
In terms of interoperability, we are thinking of doing something similar to what ckanext-dcat does with DCAT-AP, but for SDMX following the definition you provided.
ckanext-dcat
maps CKAN fields to DCAT fields following this table https://github.com/ckan/ckanext-dcat/blob/master/docs/mapping.md. For example, it specifies that the CKAN tags
field is associated with DCAT-AP dcat:keywords
field.
Then, ckanext-dcat
allows for interoperability by providing a /data.json
endpoint with the CKAN datasets metadata converted to DCAT-AP.
For TDC, we would create a ckanext-sdmx
extension that behaves in a similar way.
We could use the same metadata schema yaml file that we are using now (https://github.com/transport-data/tdc-data-portal/blob/main/src/ckanext-tdc/ckanext/tdc/schemas/dataset.yaml) to add information on how the conversion to SDMX should behave (e.g. CKAN geographies
-> REF_AREA
), and ckanext-sdmx
would read that and reflect the specification into and endpoint used for SDMX exports.
We believe this approach is very flexible for future extension and aligned with your point:
If it is possible to “round-trip” metadata structure information (the CKAN "metadata schema/fields" to an SDMX metadata structure definition and attributes, or vice versa) then we can ensure the two remain closely aligned.
@demenech I think that sounds like a great approach and a good way to build on your prior work.
The key distinction, as far as I can see:
So this means one can't map CKAN fields to "SDMX" in general, but to one particular SDMX metadata structure. For now this can be a fixed TDCI metadata structure, but I'm just motivating to future-proof by allowing it to someday be a different/evolved TDCI-defined structure, or one defined by someone else.
@khaeru thanks, it's clear. I was referring to mapping to this application-specific SMDX metadata structure. And definitely, we'll make sure to keep it flexible and future-proof, with supporting technical documentation as well.
Also, one other thing I'm wondering, do you have any sort of automatic validator for the metadata structure of the Eurostat/ESMS example? Perhaps a Python script or something like that?
Not exactly, but I will try to whip up an example this morning.
EDIT: Here https://gist.github.com/khaeru/1d386e4c35d561e2bf7dfd18249071f3
great stuff @khaeru Thank you very much!
When creating datasets, organisations and groups, I want to have a standardized metadata schema, so that I can have data consistency across the portal and make datasets easier to understand and use.
Tasklist
Acceptance Criteria