mpilhlt / skohub-vocabs

Apache License 2.0
0 stars 0 forks source link

Improve tenant and vocab parsing #2

Closed awagner-mainz closed 2 years ago

awagner-mainz commented 2 years ago

tenant and vocab make filters for distinguishing the various vocabularies in our single elasticsearch index.

These concepts refer to the way skohub.io hosts vocabularies: Presumably, skohub.io allows a "customer" to have several "vocabularies", and you can navigate by path components: A url like https://skohub.io/rg-mpg-de/vocabs-polmat/heads/main/index.en.htm has rg-mpg-de as what is here called "tenant" and vocabs-polmat as "vocab". As another example, consider https://skohub.io/dini-ag-kim/hcrt/heads/master/w3id.org/kim/hcrt/scheme.en.html with tenant dini-ag-kim and hcrt as vocab. (Possibly they ultimately derive from github account and repository names?)

The idea is that these components should also constitute Reconciliation Service endpoints like http://localhost:3004/de/rg-mpg-de/vocabs-polmat/ or .../reconcile and later also .../suggest etc.

This usage of the concepts in path components seems to exclude extracting the values from skos:ConceptScheme/dct:title in the vocabulary itself. Maybe extract them from skos:ConceptScheme/skos:notation? (But then we have to make sure these fields are populated!)

Currently, both values are hardcoded:

tenant: https://github.com/rg-mpg-de/skohub-vocabs/blob/e2f2ea8d758db41affeddb80e2661a458ad812af/src/populateReconciliation.js#L15

vocab is basename-parsed from the filename, the filepath to process being hardcoded, too: https://github.com/rg-mpg-de/skohub-vocabs/blob/e2f2ea8d758db41affeddb80e2661a458ad812af/src/populateReconciliation.js#L14

We should

awagner-mainz commented 2 years ago

more context: a skohub.io url path goes like this:

https://skohub.io/rhonda-org/vocabs-polmat/heads/main/w3id.org/rhonda/polmat/scheme.en.html

This translates to:

https://skohub.io/{github repo owner}/{github repo name}/heads/{github repo branch}/{url of the concepts/concept scheme as identified by entity ids in the ttl file}

where the github parts are specified in the webhook and the last part is the complete concept (prefix + local identifier) url minus the https?:// part...

The problem is that the github parts are not present in the ttl file.

awagner-mainz commented 2 years ago

Is this a possible approach?:

awagner-mainz commented 2 years ago

Fixed by 7a60623 .