silknow / converter

SILKNOW converter that harmonizes all museum metadata records into the common SILKNOW ontology model (based on CIDOC-CRM)
Apache License 2.0
1 stars 0 forks source link

VAM: Dimension field has sometimes whole sentences, several height and width / pattern unit information #72

Closed tschleider closed 3 years ago

tschleider commented 3 years ago

The dimensions field is right now not fully parsed if there is more information than just one height and one width.

Examples:

O10267.json: dimensions: "Length: 34 cm, Width: 66.6 cm open, Length: 13.75 in, Width: 26.5 in open

O10418.json: dimensions: "Length: 110 cm, Width: 5.5 cm, Length: 44.5 cm repeat"

pasqLisena commented 3 years ago

I think this is a follow up of #8, solved in e40223b

Both your examples seem to be correctly captured by the dimension regex. What is the output of the current code?

rtroncy commented 3 years ago

In the first example (O10267.json): the values in inches are just the same than in cm but converted in a different unit. Between, how do we manage units for dimensions in the KG?

In the second example (O10418.json): the interesting keyword is repeat. A string2vocabulary call should be made and match https://data.silknow.org/vocabulary/444! This SKOS concept should be the trigger to an additional path creation in the KG as it means a new instance of the T24_Pattern_Unit class (see also the comment on page 25 in this doc)

@tschleider Can you indicate in this github issue the URI of those 2 E22 in the KG.

pasqLisena commented 3 years ago

Between, how do we manage units for dimensions in the KG?

CIDOC-CRM offers a P91 has unit property, used e.g. here: https://data.silknow.org/object/622582e4-39aa-3888-bc68-3568995c0e1c/dimension/w

rtroncy commented 3 years ago

Good, but this is from a representation point of view. We do not attempt to convert all dimension values in a common unit? Consequently, we would not be able to do a query filter by the dimension size in a straightforward manner, right?

pasqLisena commented 3 years ago

Indeed this is a failure point. Being the number of used units is limited, we can aim to convert everything in cm.

tschleider commented 3 years ago

@pasqLisena : You're right, I should at least have referred to #8 , which has fixed a lot.

@rtroncy: Thanks for finishing the explanation, here are the two URIs:

rtroncy commented 3 years ago

Indeed this is a failure point. Being the number of used units is limited, we can aim to convert everything in cm.

It clearly is. And worst, we are collapsing values and unit, so, if I look at http://data.silknow.org/object/7367a77e-72fc-3b0a-a85a-2aad7289bf95/dimension/l, I have two values (13,75 and 34) and 2 units (cm and in) so we do not know what is expressed in what! This should be fixed.

I'm following up as well on the mailing list (message) for better understanding what usage of the dimension we are foreseeing in ADASilk. I can imagine that this can become a sorting criteria (from the biggest to the smallest object).

pasqLisena commented 3 years ago
So these are the units extracted so far: unit count what
cl 1 volume
repe 2 ERROR: should be "repeated"
ft 18 length
in 3324 length
mini 1 ERROR: should be "minimum"
troy 1 weight
mm 1199 length
kg 310 weight
a 1 ERROR: has to be ignored
cm 58136 length
maxi 2 ERROR: should be "maximum"
m 16 length
lb 1 weight
g 2 weight
bott 1 ERROR: should be "bottom"

I would use: cm for length, kg for weight and cl (unique) for volume

There are also cases like this: https://ada-preprod.silknow.org/describe/?url=http%3A%2F%2Fdata.silknow.org%2Fobject%2Fa00e8e99-b858-3cc1-9d2d-c40cf91180e5%2Fdimension%2Fw So that 3 widths have to be represented (with different URIs): min, max and border

Identically for "width" and "weight" which cannot share the same letter: https://ada-preprod.silknow.org/describe/?url=http://data.silknow.org/object/39658ae7-c3e1-3cf1-a115-ff5528f9a369/dimension/w

I am planning a 2nd round of development in the coming days:

pasqLisena commented 3 years ago

I did some modifications.

Current output for O10267:

<http://data.silknow.org/object/7367a77e-72fc-3b0a-a85a-2aad7289bf95/dimension/1>
        a                   ecrm:E54_Dimension ;
        rdfs:label          "Length: 34 cm" ;
        ecrm:P2_has_type    "length" ;
        ecrm:P90_has_value  "34"^^xsd:float ;
        ecrm:P91_has_unit   "cm" .

<http://data.silknow.org/object/7367a77e-72fc-3b0a-a85a-2aad7289bf95/dimension/2>
        a                   ecrm:E54_Dimension ;
        rdfs:comment        "open" ;
        rdfs:label          "Width: 66.6 cm open" ;
        ecrm:P2_has_type    "width" ;
        ecrm:P3_has_note    "open" ;
        ecrm:P90_has_value  "66.6"^^xsd:float ;
        ecrm:P91_has_unit   "cm" .

<http://data.silknow.org/object/7367a77e-72fc-3b0a-a85a-2aad7289bf95/dimension/3>
        a                   ecrm:E54_Dimension ;
        rdfs:label          "Length: 13.75 in" ;
        ecrm:P2_has_type    "length" ;
        ecrm:P90_has_value  "34.925"^^xsd:float ;
        ecrm:P91_has_unit   "cm" .

Output for O10418

<http://data.silknow.org/object/df126873-b14d-325d-93ad-2e79a14c1730/dimension/1>
        a                   ecrm:E54_Dimension ;
        rdfs:label          "Length: 110 cm" ;
        ecrm:P2_has_type    "length" ;
        ecrm:P90_has_value  "110"^^xsd:float ;
        ecrm:P91_has_unit   "cm" .

<http://data.silknow.org/object/df126873-b14d-325d-93ad-2e79a14c1730/dimension/2>
        a                   ecrm:E54_Dimension ;
        rdfs:label          "Width: 5.5 cm" ;
        ecrm:P2_has_type    "width" ;
        ecrm:P90_has_value  "5.5"^^xsd:float ;
        ecrm:P91_has_unit   "cm" .

<http://data.silknow.org/object/df126873-b14d-325d-93ad-2e79a14c1730/dimension/3>
        a                   ecrm:E54_Dimension ;
        rdfs:comment        "repeat" ;
        rdfs:label          "Length: 44.5 cm repeat" ;
        ecrm:P2_has_type    "length" ;
        ecrm:P3_has_note    "repeat" ;
        ecrm:P90_has_value  "44.5"^^xsd:float ;
        ecrm:P91_has_unit   "cm" .

(note that I keep now the original string in rdfs:label)

The URI system has been only changed for VAM. For others, it still has w and h (no problem of overwriting)

What do you think?

rtroncy commented 3 years ago

This is good, I would apply the same URI pattern for all dimension, so the pattern http://data.silknow.org/object/[UUID]/dimension/[count]

pasqLisena commented 3 years ago

The parsing part can be seen as completed.

What is still missing is the connection with the Patterns:

In the second example (O10418.json): the interesting keyword is repeat. A string2vocabulary call should be made and match https://data.silknow.org/vocabulary/444! This SKOS concept should be the trigger to an additional path creation in the KG as it means a new instance of the T24_Pattern_Unit class (see also the comment on page 25 in this doc)

@tschleider you take the token from here on?

tschleider commented 3 years ago

I'm almost done with the patterns, see #74

rtroncy commented 3 years ago

One potential issue I'm seeing is that now, instances of the E54_Dimension class can be found:

Correct? In any case, the URI Patterns really needs to be updated!

tschleider commented 3 years ago

Correct. If this is a problem, what could be a solution?

I'll update the URI policy

rtroncy commented 3 years ago

It is not necessarily an issue if this is clear that dimensions of objects are different than dimensions of pattern unit and if dimensions are never primary entities.

tschleider commented 3 years ago

Yes, that's the case, I'll update the pattern policy file and will close this issue

tschleider commented 3 years ago

Copied the section about the two possible E54_Dimension pattern into the URI policy. Therefore this issue here can be closed.