silknow / converter

SILKNOW converter that harmonizes all museum metadata records into the common SILKNOW ontology model (based on CIDOC-CRM)
Apache License 2.0
1 stars 0 forks source link

Handling fuzzy production time spans #66

Closed rtroncy closed 3 years ago

rtroncy commented 3 years ago

We now create systematically time spans for production and we try to interpret and define those time spans, in terms of beginning and end, attaching them to centuries when possible. This slide gives a good account of the situation per museum.

We should better represent when the production time is uncertain. A typical word used in the record is circa. Using CIDOC-CRM, we can use the property ecrm:P79_beginning_is_qualified_by attached to the time span like in doremus.

As quality check, we should count the number of objects that do NOT have a timespan attached to the production and the number of objects that have a timespan but do not have a time:hasBeginning which means we do not know yet how to interpret those time spans.

We should get rid of the timespan https://data.silknow.org/timespan/null_null (present, e.g. in http://data.silknow.org/graph/paris-musees)

The following query is interesting as it enables to list all the timespans we have created for each museums (results):

SELECT distinct ?g ?ts
WHERE {
  GRAPH ?g {
    ?o a ecrm:E22_Man-Made_Object .
  }
  ?prod ecrm:P108_has_produced ?o .
  OPTIONAL { ?prod ecrm:P4_has_time-span ?ts }
}
ORDER BY ?ts

Sub tasks:

(last point is not including MAD, to be handled in #5)

pasqLisena commented 3 years ago

We should better represent when the production time is uncertain. A typical word used in the record is circa.

Regex used for uncertain dates:

https://github.com/silknow/converter/blob/863c03611a527048608720363b18973308849d82/src/main/java/org/silknow/converter/entities/Production.java#L15

To those cases, we may want to add "or later", "or earlier", "before" and "after".

Using CIDOC-CRM, we can use the property ecrm:P79_beginning_is_qualified_by attached to the time span like in doremus.

A big difference between DOREMUS and SILKNOW is in timespans sharing among different objects.

Possible solutions:

  1. Separate the time spans (at least when we have incertitude) => Back to the DOREMUS case
  2. Mark the incertitude directly in E12 Production, using ecrm:P79_beginning_is_qualified_by (improperly), P2 has type or any other annotating property
  3. Get rid of TS in favour of a combination hutime:UncertainTimeInterval and time:ProperInterval. This gives also more flexibility with properties such as hasPossibleBeginning, hutime:hasReliableBeginning, etc. See also this paper. This solution also imply 1.
  4. Use different properties for linking E12 Production to Time Spans, keeping P4 has time-span for certain dates. Unfortunately, I am not able to find appropriate properties. We could extend CIDOC with has uncertain time-span and (if needed) before time-span, after time-span, in or after time-span, in or before time-span (all sub-properties of P4)
pasqLisena commented 3 years ago

Timespans to be deleted:

New cases to be parsed (among all unparsed ts ):

Other

How to deal with productions with different dates ? E.g. "Designed 1786; Woven 1787–91"

rtroncy commented 3 years ago

Thanks a lot @pasqLisena for this comprehensive investigation as well sketches of possible solutions. I also did more investigations. A relevant pointer, first, is https://www.loc.gov/standards/datetime/ (the so-called new EDTF format).

I search a lot in the LinkedArts community since they inherit from CIDOC-CRM practices. The LinkedArts suggested model for timespans is described at https://linked.art/model/base/#time-span-details. Read also:

At the moment, I'm hesitating between:

  1. adopting the HuTime Ontology, https://arxiv.org/abs/1905.04611 (your solution 3)
  2. staying the in the pure CIDOC-CRM world and creating different URIs when the timespan is uncertain (your solution 1) and make use of the properties P82a_begin_of_the_begin, P82b_end_of_the_end, P81_ongoing_throughout, P82_at_some_time_within, etc.

Thoughts?

pasqLisena commented 3 years ago

My thoughts:

rtroncy commented 3 years ago

OK, I also agree to keep the Solution 1 and to create "more" URIs identifying timespans, and basically a new URI each time we encounter a fuzzy timespan. I also agree to keep using the simpler start-end time of the timespan and to NOT use the newest P82a and P82b properties. I would use the EDTF notation in the label of the timespan.

pasqLisena commented 3 years ago

basically a new URI each time we encounter a fuzzy timespan

How do we create this? I think that we can use EDTF, or better a transposition of it in order to avoid special URI symbols (like ? or '/'). Examples

Otherwise, we can return to UUIDs, using EDTF as seed. (but this would invalidate again the ts uris)

I would use the EDTF notation in the label of the timespan.

In this moment, we use skos:prefLabel ("clean" unique label) and ecrm:P78_is_identified_by (all encountered labels). https://data.silknow.org/timespan/1970_1979

I propose to keep these two and add a rdfs:label with EDTF.

rtroncy commented 3 years ago

I'm wondering if we could not have a mixed approach for generating URIs for timespans, namely:

The advantage of this solution is that a human would also immediately see whether it is certain or not by looking at the URI, but maybe this is over complicated to implement?

+1 for your labeling proposal for the timespan.

pasqLisena commented 3 years ago

The advantage of this solution is that a human would also immediately see whether it is certain or not by looking at the URI, but maybe this is over complicated to implement?

It's not complicated. I proceed to implement this