silknow / converter

SILKNOW converter that harmonizes all museum metadata records into the common SILKNOW ontology model (based on CIDOC-CRM)
Apache License 2.0
1 stars 0 forks source link

Integrate the predictions of the Image and Text analysis modules in the KG using PROV #68

Closed rtroncy closed 2 years ago

rtroncy commented 3 years ago

In this google doc, we have discussed 3 proposals for modeling the results of the Image and Text analysis modules in their effort of predicting values for production place, production time, material, technique and depiction.

In terms of modeling, we have decided to adopt the 3rd variant using PROV.

What we need from @lrei and @DominicClermont is the following information:

Prior to this, @tschleider needs to generate the test files, for each of the 5 dimensions, that will be used to generate predictions. I create below the test file for the Production Time dimension with the following SPARQL query (results):

SELECT DISTINCT ?g ?o ?l
WHERE {
  GRAPH ?g {
     ?o a ecrm:E22_Man-Made_Object .
  }
  ?o rdfs:label ?l .
  ?p ecrm:P108_has_produced ?o .
  OPTIONAL {?p ecrm:P4_has_time-span ?ts}
  FILTER (!bound(?ts))
}
GROUP BY ?g
ORDER BY ?g
rtroncy commented 3 years ago

The values of rdfs:comment (for the text) as well as of ecrm:P38_has_representation/schema:contentURL should also be added in the SELECT

tschleider commented 3 years ago

Taking your last comment into account I changed the above query to the following and adapted it for all 5 dimensions accordingly:

SELECT DISTINCT ?g ?o ?l ?c ?image
WHERE {
  GRAPH ?g {
     ?o a ecrm:E22_Man-Made_Object .
  }
  ?o rdfs:label ?l .
  ?o rdfs:comment ?c . 
  ?o ecrm:P138i_has_representation ?i .
  ?i schema:contentUrl ?image .
  ?p ecrm:P108_has_produced ?o .
  OPTIONAL {?p ecrm:P32_used_general_technique  ?ts}
  FILTER (!bound(?ts))
}
GROUP BY ?g

Here is the link to all the CSV on Google Drive. https://drive.google.com/drive/folders/1hXcmAMSsTmq22nUT5HIyXfnfeEewrklf

rtroncy commented 3 years ago

Thanks @tschleider ! Can you provide some stats as how many objects there are in each test file (so per dimension) and maybe per museum as well? A table in a comment in github would do the job with 1 museum per row and each of the 5 dimension in columns.

tschleider commented 3 years ago
Museum Material Depictions Place Time Technique
VAM 1 7488 231 0 3384
Garin 589 2951 0 31 132
MFA 211 2478 0 241 2322
Venezia 3 1156 827 0 1149
Mobilier 0 964 964 173 960
MET 1 835 0 4 432
MTMAD 650 650 650 650 650
IMATEX 0 509 932 61 5
UNIPA 438 438 410 1 1
CERES 0 191 527 0 48
Smithsonian 0 146 0 0 146
Paris Musees 0 64 0 0 76
Versailles 0 56 69 0 69
Joconde 0 12 4 0 374
TOTAL 1893 17938 4715 1161 9748

EDIT: I just added the python script that creates these stats out of the CSV files that are now on Google Drive.

lrei commented 3 years ago

I'm fixing this on my side so no big problem but these should probably be "non-urgent" issues if nothing else than for documentation's sake:

  1. File text encoding is messed up. Trying UTF8 results in crash while iso-8859-1 results in "bad" characters (running ftfy fixes it on my end).
  2. Field names are different from total_post.csv (train data) which means I need to interpret them and convert them.
  3. "l" and "c" both seem like text fields (I'm merging them)
tschleider commented 3 years ago

@lrei :

  1. It's indeed not in UTF-8, but it's easy to fix that (just open them with any editor as UTF-8 and save again)
  2. I can change the field names to align with total_post.csv
  3. That's correct

I will have another look and try to provide data that's easier to use.

lrei commented 3 years ago

@tschleider i'll use this so it's fine. if you do look into it I suggest you prioritise the encoding issue :)

tschleider commented 3 years ago

The SPARQL queries that can generate the CSV files are at https://drive.google.com/drive/folders/1hXcmAMSsTmq22nUT5HIyXfnfeEewrklf and on github

I adjusted the variable names, but you can easily replace them now, if something is still off. I checked the UTF-8 problem, but originally they are encoded correctly, maybe Google Drive messes with it, in that case just run the queries yourself (or convert them).

Dorozynski commented 3 years ago

I prepared the predictions of the image classification module for the test samples. All predictions can be found in the folder https://drive.google.com/drive/u/0/folders/19vjS282Icr49Xd1YpnpA8CgY3ja0HQ-g, where the predictions for the five variables are contained in the five "sys_integrationpred[variable].csv" files.

tschleider commented 3 years ago

A first version (of the integration of the text analysis for predicting missing value) has been deployed, see for example https://data.silknow.org/production/fdda3160-1cc6-30cc-8456-b54d04a5bd7c

For the integration of the predictions from the image analysis, I just have a small issue with the depiction property.

  1. Instantiation of the the rdf:Statement class http://data.silknow.org/prediction/[UUID_object]/text/[property_type]/[int], http://data.silknow.org/prediction/[UUID_object]/image/[property_type]/[int], http://data.silknow.org/prediction/[UUID_object]/combined/[property_type]/[int]

    1. (Comment: I think we don't need [INT] here because this is ultimately just for the explanation of the algorithm / prediction method, which still needs to be added) Instantiation the prov:SoftwareAgent class http://data.silknow.org/actor/luh-image-analysis/[int] http://data.silknow.org/actor/jsi-text-analysis/[int] http://data.silknow.org/actor/silknow-joint-analysis/[int]
  2. Instantiation of prov:Activity http://data.silknow.org/prediction/UUID_object/text/[property_type]/[int]/generation, http://data.silknow.org/prediction/UUID_object/image//[property_type]/[int]/generation, http://data.silknow.org/prediction/UUID_object/combined//[property_type]/[int]/generation

rtroncy commented 3 years ago

Review:

tschleider commented 3 years ago

I think you have super mega complicated the URI pattern for rdf:Statement. I would opt for something super simpler as there is no semantics in this statement (this is a reification). I would go for: http://data.silknow.org/statement/[NEW_UUID] Of course, we need to define the seed to generate this NEW_UUID.

This makes sense, but the reason for the complicated pattern is that I'm just re-using an existing one, because otherwise the UUID has to be generated in the python script that uses the construct query. It would be easiest (from an implementation point of view) to use the prediction CSV e.g. to generate the seed. I could, however, replace /prediction/ with /statement/ and re-use the existing UUID?

I think you have also complicated the URI pattern for the prov:Activity. In short, I don't think you need the [property_type] since the int will play this role already and we don't want necessarily to separate in the URI the prediction from textual analysis for material or for technique. Same for this generation you append. Of course, this needs to have a robust counter for the [int]. How is this implemented in practice? How is this [int] being valued?

Yes, ok, it misses an explanation. The reason I did this is because I have on script each per property and the count is always initiated from 0, starting with the first row of each CSV, so it's not per object. I tried to keep it simple as it's just a post-processing script. If you want it to collapse correctly, maybe it should not have a count either? Otherwise we need some more sophisticated coding to keep track of everything across scripts / merge all CSVs, sort the rows and make a super script out of it etc.

You do need an [int] for the instance of prov:SoftwareAgent. Simply because there will be different http://data.silknow.org/actor/luh-image-analysis/, e.g. one with hyper parameter x=i and another one with x=k, etc.

Here, I didn't know we will have different instances for different parameters, but no problem, except for what I just wrote above.

Indeed, you need to add a derefencing rule for https://data.silknow.org/ontology/ ! Take care of the URI, since in the github issue you're writing http://data.silknow.org/ontology/property/L18 while in the code, I'm seeing http://data.silknow.org/ontology/L18 ! The former is actually right while it will be changed in the next version of the ontology.

Done for dereferencing. For the other part I will take more care indeed, but any change here is relatively simple.

The definition of the L18 property is not right. You're writing that the rdfs:range is a rdf:Statement while it is obviously a xsd:float. Has this property be added on OntoMe?

I was imitating the style from the ontology, where every class has a class as rdfs:range and I also checked the definition here: https://www.w3.org/TR/rdf-schema/#ch_range . "rdfs:range is an instance of rdf:Property that is used to state that the values of a property are instances of one or more classes. Maybe I got something wrong, but I don't understand how it would be xsd:float?

Do you still have issues with the image analysis prediction integration (with the depiction property)?

It's not a real issue, it just takes the longest time and the scripts are not very optimised.

Following my nose from https://data.silknow.org/production/fdda3160-1cc6-30cc-8456-b54d04a5bd7c, it is the subject of 4 statements which should simply be under the form http://data.silknow.org/statement/UUID, see above

I understand the problem. I wrote everything I can say about it above. Let me know what you think.

For time, where does the 85 (as int) come from in the URI?

It's just row 85 in the time prediction csv.

The prov:WasGeneratedBy links to the prov:Activity. I can see the text which is prov:used. What value do you consider for the prov:AtTime?

Just the creation time of the prediction CSVs.

It is prov:wasAssociatedWith the Software Agent. However, look at the rdf:type of this resource, the 't' is missing.

Right! Will be fixed.

What should be the value of the P70_documents property?

I think we agreed on an explanation from LUH and JSI? I didn't ask them for it yet, as I thought this can be done by them if the rest stands.

tschleider commented 3 years ago

Let me know if you have any questions. Unless there are bigger changes again in the KG you will be able to re-generate all these files as much as you want for a while now.

tschleider commented 3 years ago

@lrei @Dorozynski Luis made me aware of an issue: I kept the first version of the predictions in the KG when I generated the new file(s). Sorry for the incovenience, I will fix that today and send the new file(s).

tschleider commented 3 years ago

@lrei @Dorozynski I updated the CSV and created new test files for the predictions.

See here: https://github.com/silknow/converter/tree/master/jointtextimagemodule

testfile predictions are in the folder (the CSV files) and the usual total_post.csv is inside the results.7z file (together with lots of statistics CSVs).

Dorozynski commented 3 years ago

@tschleider I generated the predictions for the test files that can be found here: https://drive.google.com/drive/u/0/folders/1ytuUWiyaW5v1jkDqVUQdDhMa2i51MKpi

Details about the hyper parameter tuning and the dataset can be found in D6.7 in chapter 3 (soon available).

lrei commented 3 years ago

@tschleider As you requested I'm commenting on this issue - I've also sent the e-mail and notified you on slack:

https://drive.google.com/drive/folders/1lCu_CtOs9Fg63T6qXB2Dj3Xeg_U5-HRn

Dorozynski commented 3 years ago

Hi @lrei and @tschleider, please find the image predictions (for XGBoost) in the following folder: https://drive.google.com/drive/u/0/folders/1x1mX6FpymsHf9z8-pXNHaDp3-kAcg-jV

As depiction was not part of the training dataset to be used, no predictions were made for that variable.

I think the image predictions for the direct integration in ADASilk should remain the same (see above; 8th June 2021).

tschleider commented 3 years ago

So now we have predictions based on 1) text descriptions, 2) images and 3) categorical values (XGBoost) and they are all uploaded.

Differences:

rtroncy commented 3 years ago

Thanks for the updates!

@tschleider Can you add this triple for the XGBoostClassifier?

tschleider commented 3 years ago

I changed the counter "/1" and the triple for XGBoostClassifier, will be visible after next conversion / prediction construct queries.

tschleider commented 2 years ago

https://data.silknow.org/describe/?url=http%3A%2F%2Fwww.w3.org%2Fns%2Fprov%23SoftwareAgent&sid=63 shows the 3 SoftwareAgent we now have. These URI are good for me. Maybe we can remove the counter '/1' as there will be no other versions

An activity made by the XGBoost-classifier software agent like http://data.silknow.org/activity/18a2e08d-0aba-56a3-baee-c9b7b2f33caa SHOULD prov:used the crmdig:D1_Digital_Object object, in this case http://data.silknow.org/object/7d537eb7-7573-3646-b57a-d5c301584d24

All points are now addressed! The counter is removed and the XGBoost-classifier has prov:used set accordingly.