silknow / converter

SILKNOW converter that harmonizes all museum metadata records into the common SILKNOW ontology model (based on CIDOC-CRM)
Apache License 2.0
1 stars 0 forks source link

The property P2_has_type should generally point to controlled vocabulary values #35

Open rtroncy opened 4 years ago

rtroncy commented 4 years ago

The following query:

select distinct ?o 
where {
  ?s ecrm:P2_has_type ?o 
}

at http://data.silknow.org/sparql yields very worrying results. This needs to be analyzed for each dataset.

pasqLisena commented 4 years ago
select distinct ?g count(?s) as ?count ?o 
where {
  GRAPH ?g {?s ecrm:P2_has_type ?o }
} GROUP BY ?g ?o
ORDER BY desc(count(?s))

Some of them are easy to fix (like "en"@categories => "categories"@en)

tschleider commented 4 years ago

OK, in general maybe we need another discussion on this. The suggested change from @pasqLisena has been applied, and that was indeed a mistake.

Then there are some numbers, this should also be fixed after re-conversion, there was a typo in the code.

For the rest we need to distinguish. First of all it seems to be a problem that there is textual information in some of them. This comes from the fact that I copied @pasqLisena implementation of ecrm:E8_Acquisition from IMATEX and therefore using ecrm:P2_has_type for the acquisition type. And therefore we have values like "purchase - Diputació de Barcelona" in there. I think this is also consistent with the mapping files - or at least it's not 100% clear if we pull the value from the fields or if it is fixed. So i just extended it to other Datasets. I can work on some other solution.

Lastly, we already try to check terms with the Thesaurus, but there just do not seem to be matches. Also, it is only implemented for the method "addClassification" and not for "addObservation" or "addComplexidentifier" or just ".setType". I can change that.

rtroncy commented 4 years ago

For the rest we need to distinguish. First of all it seems to be a problem that there is textual information in some of them. This comes from the fact that I copied @pasqLisena implementation of ecrm:E8_Acquisition from IMATEX and therefore using ecrm:P2_has_type for the acquisition type. And therefore we have values like "purchase - Diputació de Barcelona" in there. I think this is also consistent with the mapping files - or at least it's not 100% clear if we pull the value from the fields or if it is fixed. So i just extended it to other Datasets. I can work on some other solution.

I think we need to see on a case by case basis. For each field that triggers an instance of ecrm:P2_has_type (and for each museum), what is the value you will encounter in order to know if you have to take the value as is or if you have to interpret it.

Lastly, we already try to check terms with the Thesaurus, but there just do not seem to be matches.

Which thesaurus? SILKNOW? AAT? Of course, we are talking here about creating new, typically small, controlled vocabularies so I'm not sure I understand your comment here.

tschleider commented 4 years ago

I think we need to see on a case by case basis. For each field that triggers an instance of ecrm:P2_has_type (and for each museum), what is the value you will encounter in order to know if you have to take the value as is or if you have to interpret it.

Now that I understand the problem better I will go through the mappings again 1-by-1 and have an exact look if there is more interpretation necessary, maybe regex is enough to split up big text fields.

Other than that I am on the other few errors/bugs, like the thing with the language tags.

Which thesaurus? SILKNOW? AAT? Of course, we are talking here about creating new, typically small, controlled vocabularies so I'm not sure I understand your comment here.

Ok, now it makes sense to me, I just discovered in the code that there is an implementation with String2Vocabulary and the method "addClassification", that as you say makes no sense right now, but we can change it to something more useful once we have created the vocabulary.

tschleider commented 4 years ago

This query gives an overview of where we are standing.

All these huge strings like ["Maria P. James , Norwalk, Connecticut (until d. 1910; bequeathed to MMA)] come from MFA as the mapping suggests to split up the "Provenance" field. I tried unsuccessfully to use regex to extract the type ("Gift") and the direction of the transfer ( " P23_transferred_title_from " ) from this one field, so for the next conversion I can leave that out and it will look less messy for the P2_has_type.

@rtroncy and @pasqLisena : In general I do not know which types should be in this controlled vocabulary and which should not. One way or the other they are all coming from the mappings right now.

pasqLisena commented 4 years ago

So, first of all, everything should go lowercase.

Then, for sure we can firstly attack everything that has >1000 occurrences, trying also to merge multilanguage (e.g. "Description" and "Descripción") under a unique identifier  

tschleider commented 4 years ago

Thanks, yes, I agree with this, but I thought more about the question of what concepts should be in it and which ones look completely wrong for this property (which means it needs to be left out of this vocabulary and we need to look at the mapping rules).

tschleider commented 4 years ago

I started with a controlled vocabulary solely for P2_has_type of E17 Type Assignment to catch occurrences of strings like "Domain" and implemented it already. See the vocabulary file for this here.

In general I updated the google sheet for ALL P2_has_type in all datasets but except for cases of E17 it's not implemented / the vocabulary is not written yet. As an explanation: Everything I marked in orange are mostly terms that occur across datasets, some of them in different languages so I started with them. I created another tab called "Concepts" as a plan for the complete TTL vocabulary.

Next steps is to finish the TTL and then to expand the implementation.

rtroncy commented 3 years ago

I'm temporarily updating this issue that would need a lot of attention in the coming weeks. The overall goal will be to replace ALL strings that are currently values of ecrm:P2_has_type by a URI, being an existing concepts from an ontology we will re-use or a new term we will create in small controlled vocabularies.

The following SPARQL query shows what strings will need to be updated (results).

SELECT distinct ?g count(?s) as ?count ?o 
WHERE {
  GRAPH ?g {?s ecrm:P2_has_type ?o }
} 
GROUP BY ?g ?o
ORDER BY ?o

A first easy change is to replace all 'Dataset' by https://schema.org/Dataset.

rtroncy commented 3 years ago

Another easy low hanging fruit is to normalize the type of crmsci:S4_Observation made on each objects.

The following SPARQL query provides a good account of the current situation (results):

select distinct ?g ?obs_type count(?obs) as ?nbObs
where {
  graph ?g { ?o a ecrm:E22_Man-Made_Object }
  ?obs crmsci:O8_observed ?o .
  ?obs ecrm:P2_has_type ?obs_type .
}
group by ?g ?obs_type
order by ?g

First, we should determine with the domain experts the true set of different types of observations we have or could make. Second, we should either build a small vocabulary for those types OR creating new specific meaningful sub-properties. Third, we should ask domain experts how those observations should be queried and visualized in ADASilk since we do not distinguish them at the moment which is a pity.

tschleider commented 3 years ago

Thanks to @mpuren and Pierre for the following taxonomy regarding types of observations (S4):

Classification Actual type
General observation
^ Clasificación Razonada
^ Descripción (CERES)
^ Descripción (GARIN)
^ Description (IMATEX)
^ Description (JOCONDE)
^ Description (MET)
^ Description (MFA)
^ Description (Mobilier)
^ Description (MTMAD)
^ Descriptive Line
^ Labels and date
^ Summary
^ Inscription (Joconde)
^ Historical Critical Information
Technical observation
^ Descripción técnica
^ Technical description
^ Weft
^ Warp
^ Construction
^ Physical description
^ Indicazioni sull'oggetto
^ Production Type (VAM)
^ Medium (Artic)
^ Width
^ Pattern unit
Iconographical observation
^ note
^ Description of the pattern
^ Description iconographique (Paris Musees)
Historical observation
^ Historique
^ Historical Context Note
^ Contexto Cultural/Estilo
Inscription
^ Inscription (Joconde)
^ Marcas (Museo de Arte Sacro El Tesoro de la Concepción)
^ Type of inscription (Museo de Saint Etienne)

EDIT: I updated the table with a few missing types as discussed. EDIT2: Applied the changes suggested by @mpuren

rtroncy commented 3 years ago

In order to close this issue, we have decided to create a small controlled vocabulary composed of 5 skos:Concept. The concepts will have for label and the URI pattern:

I suggest to create an 'observation' folder at https://github.com/silknow/knowledge-base/tree/master/vocabularies with the concept scheme.

What remains to be done :

tschleider commented 3 years ago

Example for the new General Observation classification in the vocabulary: Link

rtroncy commented 3 years ago

Thanks, this has resolved the issue for properly typing the crmsci:S4_Observation. I would keep this issue open which is more general and concerns all ecrm:P2_has_type values. See again the results of the following SPARQL query:

select distinct ?g ?obs_type count(?obs) as ?nbObs
where {
  graph ?g { ?o a ecrm:E22_Man-Made_Object }
  ?obs crmsci:O8_observed ?o .
  ?obs ecrm:P2_has_type ?obs_type .
}
group by ?g ?obs_type
order by ?g
tschleider commented 3 years ago

Yes, right, I will take care of the other P2. I pushed everything to production, so your query gives now the correct results @rtroncy : results.

@pasqLisena : I think it's a side effect of your (otherwise very nice) VAM dimension parsing that we have now again a lot of P2_has_type that look wrong (bottom of list): results. I can try to change it myself just wanted to let you know. I think for dimensions the only types should be something like "width", "length", "height" etc.

mpuren commented 3 years ago

@tschleider There are some changes to be made to the table.

tschleider commented 3 years ago

I applied the changes suggested by @mpuren (first in the table above, then in the code, will be applied for the next conversion), but I cannot find "Marcas" and "Type of inscription" in the query results. Either we have a translation mismatch here or I need to update some mappings.

tschleider commented 3 years ago

I created a new vocabulary for dimensions as discussed: https://data.silknow.org/dimension/

Now this took care of most P2 values: Link

rtroncy commented 3 years ago

There seem to be 2 more quick wins:

How to address these 2?

tschleider commented 3 years ago
rtroncy commented 3 years ago

I'm not sure I understand your proposal. We are talking about normalizing P2_has_type values and thus creating controlled vocabularies. Can you write down a full example of your proposal?

tschleider commented 2 years ago

Forget the last sentence, I think I mixed up issues in my head.

I can indeed creat another controlled vocabulary that includes all alternative spellings for both concepts respectively and run string2vocabulary on it.

tschleider commented 2 years ago

I created two new vocabulary for the classes "activity" and "information object" with an entry for "designer" and "bibliography" respectively.

Based on this the P2 list looks cleaner now: Link

There is still one case of "author" from Garin that needs to be fixed, but Garin is not supposed to be online (both will be fixed by tomorrow).

rtroncy commented 2 years ago

I'm seeing a "Description analytique" from the graph http://data.silknow.org/graph/musee-st-etienne that probably needs to be converted to http://data.silknow.org/observation/general-observation or http://data.silknow.org/observation/technical-observation.

Pay attention, http://data.silknow.org/observation/xxx URIs are not dereferencable. You need to add the corresponding rules.

Why do we not have further information when looking up http://data.silknow.org/assignment/object_domain_assignment and http://data.silknow.org/assignment/object_type_assignment?

Some more low hanging fruits:

tschleider commented 2 years ago

I have not had the time to address your comments and questions, but I'm just writing here to highlight a bug that still exists: In some cases the string of P2 gets not properly replaced / exists together with a string:

https://data.silknow.org/production/fde5d77f-2198-3d43-a39c-0b53cb87baa1/activity/2

It's pretty weird, because that's not how the TTL looks like. It needs further investigation.