Integrate the predictions of the Image and Text analysis modules in the KG using PROV

rtroncy commented 3 years ago

In this google doc, we have discussed 3 proposals for modeling the results of the Image and Text analysis modules in their effort of predicting values for production place, production time, material, technique and depiction.

In terms of modeling, we have decided to adopt the 3rd variant using PROV.

What we need from @lrei and @DominicClermont is the following information:

The value which is being predicted: for technique, material and depiction, this will be one of the group which we have defined; for production time, this will probably be one of the century that should be transformed in a Getty link; for production place, this will probably be a country name that should be transformed in a Geonames link.
A confidence score as a real value between 0 and 1
A string explaining the algorithm being used including the hyper-parameters values, a link to the model being used, etc. This string will be the value of the ecrm:P3_has_note property attached to the prov:Activity.

Prior to this, @tschleider needs to generate the test files, for each of the 5 dimensions, that will be used to generate predictions. I create below the test file for the Production Time dimension with the following SPARQL query (results):

SELECT DISTINCT ?g ?o ?l
WHERE {
  GRAPH ?g {
     ?o a ecrm:E22_Man-Made_Object .
  }
  ?o rdfs:label ?l .
  ?p ecrm:P108_has_produced ?o .
  OPTIONAL {?p ecrm:P4_has_time-span ?ts}
  FILTER (!bound(?ts))
}
GROUP BY ?g
ORDER BY ?g

rtroncy commented 3 years ago

The values of rdfs:comment (for the text) as well as of ecrm:P38_has_representation/schema:contentURL should also be added in the SELECT

tschleider commented 3 years ago

Taking your last comment into account I changed the above query to the following and adapted it for all 5 dimensions accordingly:

SELECT DISTINCT ?g ?o ?l ?c ?image
WHERE {
  GRAPH ?g {
     ?o a ecrm:E22_Man-Made_Object .
  }
  ?o rdfs:label ?l .
  ?o rdfs:comment ?c . 
  ?o ecrm:P138i_has_representation ?i .
  ?i schema:contentUrl ?image .
  ?p ecrm:P108_has_produced ?o .
  OPTIONAL {?p ecrm:P32_used_general_technique  ?ts}
  FILTER (!bound(?ts))
}
GROUP BY ?g

Here is the link to all the CSV on Google Drive. https://drive.google.com/drive/folders/1hXcmAMSsTmq22nUT5HIyXfnfeEewrklf

rtroncy commented 3 years ago

Thanks @tschleider ! Can you provide some stats as how many objects there are in each test file (so per dimension) and maybe per museum as well? A table in a comment in github would do the job with 1 museum per row and each of the 5 dimension in columns.

tschleider commented 3 years ago

Museum	Material	Depictions	Place	Time	Technique
VAM	1	7488	231	0	3384
Garin	589	2951	0	31	132
MFA	211	2478	0	241	2322
Venezia	3	1156	827	0	1149
Mobilier	0	964	964	173	960
MET	1	835	0	4	432
MTMAD	650	650	650	650	650
IMATEX	0	509	932	61	5
UNIPA	438	438	410	1	1
CERES	0	191	527	0	48
Smithsonian	0	146	0	0	146
Paris Musees	0	64	0	0	76
Versailles	0	56	69	0	69
Joconde	0	12	4	0	374
TOTAL	1893	17938	4715	1161	9748

EDIT: I just added the python script that creates these stats out of the CSV files that are now on Google Drive.

lrei commented 3 years ago

I'm fixing this on my side so no big problem but these should probably be "non-urgent" issues if nothing else than for documentation's sake:

File text encoding is messed up. Trying UTF8 results in crash while iso-8859-1 results in "bad" characters (running ftfy fixes it on my end).
Field names are different from total_post.csv (train data) which means I need to interpret them and convert them.
"l" and "c" both seem like text fields (I'm merging them)

tschleider commented 3 years ago

@lrei :

It's indeed not in UTF-8, but it's easy to fix that (just open them with any editor as UTF-8 and save again)
I can change the field names to align with total_post.csv
That's correct

I will have another look and try to provide data that's easier to use.

lrei commented 3 years ago

@tschleider i'll use this so it's fine. if you do look into it I suggest you prioritise the encoding issue :)

tschleider commented 3 years ago

The SPARQL queries that can generate the CSV files are at https://drive.google.com/drive/folders/1hXcmAMSsTmq22nUT5HIyXfnfeEewrklf and on github

I adjusted the variable names, but you can easily replace them now, if something is still off. I checked the UTF-8 problem, but originally they are encoded correctly, maybe Google Drive messes with it, in that case just run the queries yourself (or convert them).

Dorozynski commented 3 years ago

I prepared the predictions of the image classification module for the test samples. All predictions can be found in the folder https://drive.google.com/drive/u/0/folders/19vjS282Icr49Xd1YpnpA8CgY3ja0HQ-g, where the predictions for the five variables are contained in the five "sys_integrationpred[variable].csv" files.

tschleider commented 3 years ago

A first version (of the integration of the text analysis for predicting missing value) has been deployed, see for example https://data.silknow.org/production/fdda3160-1cc6-30cc-8456-b54d04a5bd7c

For the integration of the predictions from the image analysis, I just have a small issue with the depiction property.

The predictions for the various properties are directly attached to the production resource.
The URI policy for the prediction is as follows, I will add it to the general file if we can agree on it:

Instantiation of the the rdf:Statement class http://data.silknow.org/prediction/[UUID_object]/text/[property_type]/[int], http://data.silknow.org/prediction/[UUID_object]/image/[property_type]/[int], http://data.silknow.org/prediction/[UUID_object]/combined/[property_type]/[int]
1. (Comment: I think we don't need [INT] here because this is ultimately just for the explanation of the algorithm / prediction method, which still needs to be added) Instantiation the prov:SoftwareAgent class http://data.silknow.org/actor/luh-image-analysis/[int] http://data.silknow.org/actor/jsi-text-analysis/[int] http://data.silknow.org/actor/silknow-joint-analysis/[int]
Instantiation of prov:Activity http://data.silknow.org/prediction/UUID_object/text/[property_type]/[int]/generation, http://data.silknow.org/prediction/UUID_object/image//[property_type]/[int]/generation, http://data.silknow.org/prediction/UUID_object/combined//[property_type]/[int]/generation

I also updated the ontology with regards to L18 (new "has_confidence_score" property), but we still need to update the dereferencing rules. (Should be there by Monday). It's implemented like this: https://data.silknow.org/describe/?url=http%3A%2F%2Fdata.silknow.org%2Fontology%2Fproperty%2FL18&sid=669

rtroncy commented 3 years ago

Review:

I think you have super mega complicated the URI pattern for rdf:Statement. I would opt for something super simpler as there is no semantics in this statement (this is a reification). I would go for: <http://data.silknow.org/statement/[NEW_UUID]> Of course, we need to define the seed to generate this NEW_UUID.
I think you have also complicated the URI pattern for the prov:Activity. In short, I don't think you need the [property_type] since the int will play this role already and we don't want necessarily to separate in the URI the prediction from textual analysis for material or for technique. Same for this generation you append. Of course, this needs to have a robust counter for the [int]. How is this implemented in practice? How is this [int] being valued?
You do need an [int] for the instance of prov:SoftwareAgent. Simply because there will be different http://data.silknow.org/actor/luh-image-analysis/, e.g. one with hyper parameter x=i and another one with x=k, etc.
Indeed, you need to add a derefencing rule for https://data.silknow.org/ontology/ ! Take care of the URI, since in the github issue you're writing http://data.silknow.org/ontology/property/L18 while in the code, I'm seeing http://data.silknow.org/ontology/L18 ! The former is actually right while it will be changed in the next version of the ontology.
The definition of the L18 property is not right. You're writing that the rdfs:range is a rdf:Statement while it is obviously a xsd:float. Has this property be added on OntoMe?
Do you still have issues with the image analysis prediction integration (with the depiction property)?
Following my nose from https://data.silknow.org/production/fdda3160-1cc6-30cc-8456-b54d04a5bd7c, it is the subject of 4 statements which should simply be under the form <http://data.silknow.org/statement/UUID>, see above
For time, where does the 85 (as int) come from in the URI?
The prov:WasGeneratedBy links to the prov:Activity. I can see the text which is prov:used. What value do you consider for the prov:AtTime?
It is prov:wasAssociatedWith the Software Agent. However, look at the rdf:type of this resource, the 't' is missing.
What should be the value of the P70_documents property?

tschleider commented 3 years ago

I think you have super mega complicated the URI pattern for rdf:Statement. I would opt for something super simpler as there is no semantics in this statement (this is a reification). I would go for: http://data.silknow.org/statement/[NEW_UUID] Of course, we need to define the seed to generate this NEW_UUID.

This makes sense, but the reason for the complicated pattern is that I'm just re-using an existing one, because otherwise the UUID has to be generated in the python script that uses the construct query. It would be easiest (from an implementation point of view) to use the prediction CSV e.g. to generate the seed. I could, however, replace /prediction/ with /statement/ and re-use the existing UUID?

I think you have also complicated the URI pattern for the prov:Activity. In short, I don't think you need the [property_type] since the int will play this role already and we don't want necessarily to separate in the URI the prediction from textual analysis for material or for technique. Same for this generation you append. Of course, this needs to have a robust counter for the [int]. How is this implemented in practice? How is this [int] being valued?

Yes, ok, it misses an explanation. The reason I did this is because I have on script each per property and the count is always initiated from 0, starting with the first row of each CSV, so it's not per object. I tried to keep it simple as it's just a post-processing script. If you want it to collapse correctly, maybe it should not have a count either? Otherwise we need some more sophisticated coding to keep track of everything across scripts / merge all CSVs, sort the rows and make a super script out of it etc.

You do need an [int] for the instance of prov:SoftwareAgent. Simply because there will be different http://data.silknow.org/actor/luh-image-analysis/, e.g. one with hyper parameter x=i and another one with x=k, etc.

Here, I didn't know we will have different instances for different parameters, but no problem, except for what I just wrote above.

Indeed, you need to add a derefencing rule for https://data.silknow.org/ontology/ ! Take care of the URI, since in the github issue you're writing http://data.silknow.org/ontology/property/L18 while in the code, I'm seeing http://data.silknow.org/ontology/L18 ! The former is actually right while it will be changed in the next version of the ontology.

Done for dereferencing. For the other part I will take more care indeed, but any change here is relatively simple.

The definition of the L18 property is not right. You're writing that the rdfs:range is a rdf:Statement while it is obviously a xsd:float. Has this property be added on OntoMe?

I was imitating the style from the ontology, where every class has a class as rdfs:range and I also checked the definition here: https://www.w3.org/TR/rdf-schema/#ch_range . "rdfs:range is an instance of rdf:Property that is used to state that the values of a property are instances of one or more classes. Maybe I got something wrong, but I don't understand how it would be xsd:float?

Do you still have issues with the image analysis prediction integration (with the depiction property)?

It's not a real issue, it just takes the longest time and the scripts are not very optimised.

Following my nose from https://data.silknow.org/production/fdda3160-1cc6-30cc-8456-b54d04a5bd7c, it is the subject of 4 statements which should simply be under the form http://data.silknow.org/statement/UUID, see above

I understand the problem. I wrote everything I can say about it above. Let me know what you think.

For time, where does the 85 (as int) come from in the URI?

It's just row 85 in the time prediction csv.

The prov:WasGeneratedBy links to the prov:Activity. I can see the text which is prov:used. What value do you consider for the prov:AtTime?

Just the creation time of the prediction CSVs.

It is prov:wasAssociatedWith the Software Agent. However, look at the rdf:type of this resource, the 't' is missing.

Right! Will be fixed.

What should be the value of the P70_documents property?

I think we agreed on an explanation from LUH and JSI? I didn't ask them for it yet, as I thought this can be done by them if the rest stands.

tschleider commented 3 years ago

I updated the test files for the prediction. Just as last time they are on Google Drive: Link
The sparql queries for these files are in this repo: Link
Statistics can be generated with a Python script (CSV files need to be in the same folder): Link

Let me know if you have any questions. Unless there are bigger changes again in the KG you will be able to re-generate all these files as much as you want for a while now.

tschleider commented 3 years ago

@lrei @Dorozynski Luis made me aware of an issue: I kept the first version of the predictions in the KG when I generated the new file(s). Sorry for the incovenience, I will fix that today and send the new file(s).

tschleider commented 3 years ago

@lrei @Dorozynski I updated the CSV and created new test files for the predictions.

See here: https://github.com/silknow/converter/tree/master/jointtextimagemodule

testfile predictions are in the folder (the CSV files) and the usual total_post.csv is inside the results.7z file (together with lots of statistics CSVs).

Dorozynski commented 3 years ago

@tschleider I generated the predictions for the test files that can be found here: https://drive.google.com/drive/u/0/folders/1ytuUWiyaW5v1jkDqVUQdDhMa2i51MKpi

Details about the hyper parameter tuning and the dataset can be found in D6.7 in chapter 3 (soon available).

lrei commented 3 years ago

@tschleider As you requested I'm commenting on this issue - I've also sent the e-mail and notified you on slack:

https://drive.google.com/drive/folders/1lCu_CtOs9Fg63T6qXB2Dj3Xeg_U5-HRn

Dorozynski commented 3 years ago

Hi @lrei and @tschleider, please find the image predictions (for XGBoost) in the following folder: https://drive.google.com/drive/u/0/folders/1x1mX6FpymsHf9z8-pXNHaDp3-kAcg-jV

As depiction was not part of the training dataset to be used, no predictions were made for that variable.

I think the image predictions for the direct integration in ADASilk should remain the same (see above; 8th June 2021).

tschleider commented 3 years ago

So now we have predictions based on 1) text descriptions, 2) images and 3) categorical values (XGBoost) and they are all uploaded.

Differences:

Different values for ecrm:P70_documents of prov:SoftwareAgent according to the descriptions @Dorozynski and @lrei sent us. (Example)
Different URIs for prov:SoftwareAgent. I didn't know what to pick for the categorical value / XGBoost one (now I have "http://data.silknow.org/actor/XGBoost-classifier/1"), so see yourself and let me know what I should choose: Example
For the categorical values ones I didn't know what to pick as value for prov:used. I know all (other) categorical values are used, but I wasn't sure how to refer to them here. Therefore I removed it for now (for the XGBoost / categorical predictions)

rtroncy commented 3 years ago

Thanks for the updates!

https://data.silknow.org/describe/?url=http%3A%2F%2Fwww.w3.org%2Fns%2Fprov%23SoftwareAgent&sid=63 shows the 3 SoftwareAgent we now have. These URI are good for me. Maybe we can remove the counter '/1' as there will be no other versions
An activity made by the luh-image-analysis software agent like http://data.silknow.org/activity/527c9c94-95de-5458-8423-af0dddc4c087 is prov:used an image like https://silknow.org/silknow/media/garin/GP00491_REV.jpg
An activity made by the jsi-text-analysis software agent like http://data.silknow.org/activity/db89216d-a62a-5fa2-8ba0-d14469f8afad is prov:used a text like: "Deep scrolling border on red satin fabric with patterned white stripes.@en"
An activity made by the XGBoost-classifier software agent like http://data.silknow.org/activity/18a2e08d-0aba-56a3-baee-c9b7b2f33caa SHOULD prov:used the crmdig:D1_Digital_Object object, in this case http://data.silknow.org/object/7d537eb7-7573-3646-b57a-d5c301584d24

@tschleider Can you add this triple for the XGBoostClassifier?

tschleider commented 3 years ago

I changed the counter "/1" and the triple for XGBoostClassifier, will be visible after next conversion / prediction construct queries.

tschleider commented 2 years ago

https://data.silknow.org/describe/?url=http%3A%2F%2Fwww.w3.org%2Fns%2Fprov%23SoftwareAgent&sid=63 shows the 3 SoftwareAgent we now have. These URI are good for me. Maybe we can remove the counter '/1' as there will be no other versions

An activity made by the XGBoost-classifier software agent like http://data.silknow.org/activity/18a2e08d-0aba-56a3-baee-c9b7b2f33caa SHOULD prov:used the crmdig:D1_Digital_Object object, in this case http://data.silknow.org/object/7d537eb7-7573-3646-b57a-d5c301584d24

All points are now addressed! The counter is removed and the XGBoost-classifier has prov:used set accordingly.

silknow / converter

Integrate the predictions of the Image and Text analysis modules in the KG using PROV #68