Closed rtroncy closed 2 years ago
The values of rdfs:comment
(for the text) as well as of ecrm:P38_has_representation/schema:contentURL
should also be added in the SELECT
Taking your last comment into account I changed the above query to the following and adapted it for all 5 dimensions accordingly:
SELECT DISTINCT ?g ?o ?l ?c ?image
WHERE {
GRAPH ?g {
?o a ecrm:E22_Man-Made_Object .
}
?o rdfs:label ?l .
?o rdfs:comment ?c .
?o ecrm:P138i_has_representation ?i .
?i schema:contentUrl ?image .
?p ecrm:P108_has_produced ?o .
OPTIONAL {?p ecrm:P32_used_general_technique ?ts}
FILTER (!bound(?ts))
}
GROUP BY ?g
Here is the link to all the CSV on Google Drive. https://drive.google.com/drive/folders/1hXcmAMSsTmq22nUT5HIyXfnfeEewrklf
Thanks @tschleider ! Can you provide some stats as how many objects there are in each test file (so per dimension) and maybe per museum as well? A table in a comment in github would do the job with 1 museum per row and each of the 5 dimension in columns.
Museum | Material | Depictions | Place | Time | Technique |
---|---|---|---|---|---|
VAM | 1 | 7488 | 231 | 0 | 3384 |
Garin | 589 | 2951 | 0 | 31 | 132 |
MFA | 211 | 2478 | 0 | 241 | 2322 |
Venezia | 3 | 1156 | 827 | 0 | 1149 |
Mobilier | 0 | 964 | 964 | 173 | 960 |
MET | 1 | 835 | 0 | 4 | 432 |
MTMAD | 650 | 650 | 650 | 650 | 650 |
IMATEX | 0 | 509 | 932 | 61 | 5 |
UNIPA | 438 | 438 | 410 | 1 | 1 |
CERES | 0 | 191 | 527 | 0 | 48 |
Smithsonian | 0 | 146 | 0 | 0 | 146 |
Paris Musees | 0 | 64 | 0 | 0 | 76 |
Versailles | 0 | 56 | 69 | 0 | 69 |
Joconde | 0 | 12 | 4 | 0 | 374 |
TOTAL | 1893 | 17938 | 4715 | 1161 | 9748 |
EDIT: I just added the python script that creates these stats out of the CSV files that are now on Google Drive.
I'm fixing this on my side so no big problem but these should probably be "non-urgent" issues if nothing else than for documentation's sake:
@lrei :
I will have another look and try to provide data that's easier to use.
@tschleider i'll use this so it's fine. if you do look into it I suggest you prioritise the encoding issue :)
The SPARQL queries that can generate the CSV files are at https://drive.google.com/drive/folders/1hXcmAMSsTmq22nUT5HIyXfnfeEewrklf and on github
I adjusted the variable names, but you can easily replace them now, if something is still off. I checked the UTF-8 problem, but originally they are encoded correctly, maybe Google Drive messes with it, in that case just run the queries yourself (or convert them).
I prepared the predictions of the image classification module for the test samples. All predictions can be found in the folder https://drive.google.com/drive/u/0/folders/19vjS282Icr49Xd1YpnpA8CgY3ja0HQ-g, where the predictions for the five variables are contained in the five "sys_integrationpred[variable].csv" files.
A first version (of the integration of the text analysis for predicting missing value) has been deployed, see for example https://data.silknow.org/production/fdda3160-1cc6-30cc-8456-b54d04a5bd7c
For the integration of the predictions from the image analysis, I just have a small issue with the depiction property.
Instantiation of the the rdf:Statement
class
http://data.silknow.org/prediction/[UUID_object]/text/[property_type]/[int],
http://data.silknow.org/prediction/[UUID_object]/image/[property_type]/[int],
http://data.silknow.org/prediction/[UUID_object]/combined/[property_type]/[int]
prov:SoftwareAgent
class
http://data.silknow.org/actor/luh-image-analysis/[int]
http://data.silknow.org/actor/jsi-text-analysis/[int]
http://data.silknow.org/actor/silknow-joint-analysis/[int]Instantiation of prov:Activity
http://data.silknow.org/prediction/UUID_object/text/[property_type]/[int]/generation,
http://data.silknow.org/prediction/UUID_object/image//[property_type]/[int]/generation,
http://data.silknow.org/prediction/UUID_object/combined//[property_type]/[int]/generation
Review:
I think you have super mega complicated the URI pattern for rdf:Statement
. I would opt for something super simpler as there is no semantics in this statement (this is a reification). I would go for: <http://data.silknow.org/statement/[NEW_UUID]>
Of course, we need to define the seed to generate this NEW_UUID.
I think you have also complicated the URI pattern for the prov:Activity
. In short, I don't think you need the [property_type]
since the int will play this role already and we don't want necessarily to separate in the URI the prediction from textual analysis for material or for technique. Same for this generation you append. Of course, this needs to have a robust counter for the [int]. How is this implemented in practice? How is this [int] being valued?
You do need an [int] for the instance of prov:SoftwareAgent
. Simply because there will be different http://data.silknow.org/actor/luh-image-analysis/, e.g. one with hyper parameter x=i and another one with x=k, etc.
Indeed, you need to add a derefencing rule for https://data.silknow.org/ontology/ ! Take care of the URI, since in the github issue you're writing http://data.silknow.org/ontology/property/L18 while in the code, I'm seeing http://data.silknow.org/ontology/L18 ! The former is actually right while it will be changed in the next version of the ontology.
The definition of the L18 property is not right. You're writing that the rdfs:range
is a rdf:Statement
while it is obviously a xsd:float
. Has this property be added on OntoMe?
Do you still have issues with the image analysis prediction integration (with the depiction property)?
Following my nose from https://data.silknow.org/production/fdda3160-1cc6-30cc-8456-b54d04a5bd7c, it is the subject of 4 statements which should simply be under the form <http://data.silknow.org/statement/UUID>
, see above
For time, where does the 85 (as int) come from in the URI?
The prov:WasGeneratedBy
links to the prov:Activity
. I can see the text which is prov:used
. What value do you consider for the prov:AtTime
?
It is prov:wasAssociatedWith
the Software Agent. However, look at the rdf:type
of this resource, the 't' is missing.
What should be the value of the P70_documents
property?
I think you have super mega complicated the URI pattern for rdf:Statement. I would opt for something super simpler as there is no semantics in this statement (this is a reification). I would go for: http://data.silknow.org/statement/[NEW_UUID] Of course, we need to define the seed to generate this NEW_UUID.
This makes sense, but the reason for the complicated pattern is that I'm just re-using an existing one, because otherwise the UUID has to be generated in the python script that uses the construct query. It would be easiest (from an implementation point of view) to use the prediction CSV e.g. to generate the seed. I could, however, replace /prediction/ with /statement/ and re-use the existing UUID?
I think you have also complicated the URI pattern for the prov:Activity. In short, I don't think you need the [property_type] since the int will play this role already and we don't want necessarily to separate in the URI the prediction from textual analysis for material or for technique. Same for this generation you append. Of course, this needs to have a robust counter for the [int]. How is this implemented in practice? How is this [int] being valued?
Yes, ok, it misses an explanation. The reason I did this is because I have on script each per property and the count is always initiated from 0, starting with the first row of each CSV, so it's not per object. I tried to keep it simple as it's just a post-processing script. If you want it to collapse correctly, maybe it should not have a count either? Otherwise we need some more sophisticated coding to keep track of everything across scripts / merge all CSVs, sort the rows and make a super script out of it etc.
You do need an [int] for the instance of prov:SoftwareAgent. Simply because there will be different http://data.silknow.org/actor/luh-image-analysis/, e.g. one with hyper parameter x=i and another one with x=k, etc.
Here, I didn't know we will have different instances for different parameters, but no problem, except for what I just wrote above.
Indeed, you need to add a derefencing rule for https://data.silknow.org/ontology/ ! Take care of the URI, since in the github issue you're writing http://data.silknow.org/ontology/property/L18 while in the code, I'm seeing http://data.silknow.org/ontology/L18 ! The former is actually right while it will be changed in the next version of the ontology.
Done for dereferencing. For the other part I will take more care indeed, but any change here is relatively simple.
The definition of the L18 property is not right. You're writing that the rdfs:range is a rdf:Statement while it is obviously a xsd:float. Has this property be added on OntoMe?
I was imitating the style from the ontology, where every class has a class as rdfs:range and I also checked the definition here: https://www.w3.org/TR/rdf-schema/#ch_range . "rdfs:range is an instance of rdf:Property that is used to state that the values of a property are instances of one or more classes. Maybe I got something wrong, but I don't understand how it would be xsd:float?
Do you still have issues with the image analysis prediction integration (with the depiction property)?
It's not a real issue, it just takes the longest time and the scripts are not very optimised.
Following my nose from https://data.silknow.org/production/fdda3160-1cc6-30cc-8456-b54d04a5bd7c, it is the subject of 4 statements which should simply be under the form http://data.silknow.org/statement/UUID, see above
I understand the problem. I wrote everything I can say about it above. Let me know what you think.
For time, where does the 85 (as int) come from in the URI?
It's just row 85 in the time prediction csv.
The prov:WasGeneratedBy links to the prov:Activity. I can see the text which is prov:used. What value do you consider for the prov:AtTime?
Just the creation time of the prediction CSVs.
It is prov:wasAssociatedWith the Software Agent. However, look at the rdf:type of this resource, the 't' is missing.
Right! Will be fixed.
What should be the value of the P70_documents property?
I think we agreed on an explanation from LUH and JSI? I didn't ask them for it yet, as I thought this can be done by them if the rest stands.
I updated the test files for the prediction. Just as last time they are on Google Drive: Link
The sparql queries for these files are in this repo: Link
Statistics can be generated with a Python script (CSV files need to be in the same folder): Link
Let me know if you have any questions. Unless there are bigger changes again in the KG you will be able to re-generate all these files as much as you want for a while now.
@lrei @Dorozynski Luis made me aware of an issue: I kept the first version of the predictions in the KG when I generated the new file(s). Sorry for the incovenience, I will fix that today and send the new file(s).
@lrei @Dorozynski I updated the CSV and created new test files for the predictions.
See here: https://github.com/silknow/converter/tree/master/jointtextimagemodule
testfile predictions are in the folder (the CSV files) and the usual total_post.csv is inside the results.7z file (together with lots of statistics CSVs).
@tschleider I generated the predictions for the test files that can be found here: https://drive.google.com/drive/u/0/folders/1ytuUWiyaW5v1jkDqVUQdDhMa2i51MKpi
Details about the hyper parameter tuning and the dataset can be found in D6.7 in chapter 3 (soon available).
@tschleider As you requested I'm commenting on this issue - I've also sent the e-mail and notified you on slack:
https://drive.google.com/drive/folders/1lCu_CtOs9Fg63T6qXB2Dj3Xeg_U5-HRn
Hi @lrei and @tschleider, please find the image predictions (for XGBoost) in the following folder: https://drive.google.com/drive/u/0/folders/1x1mX6FpymsHf9z8-pXNHaDp3-kAcg-jV
As depiction was not part of the training dataset to be used, no predictions were made for that variable.
I think the image predictions for the direct integration in ADASilk should remain the same (see above; 8th June 2021).
So now we have predictions based on 1) text descriptions, 2) images and 3) categorical values (XGBoost) and they are all uploaded.
Differences:
Thanks for the updates!
prov:used
an image like https://silknow.org/silknow/media/garin/GP00491_REV.jpgprov:used
a text like: "Deep scrolling border on red satin fabric with patterned white stripes.@en
"prov:used
the crmdig:D1_Digital_Object
object, in this case http://data.silknow.org/object/7d537eb7-7573-3646-b57a-d5c301584d24 @tschleider Can you add this triple for the XGBoostClassifier?
I changed the counter "/1" and the triple for XGBoostClassifier, will be visible after next conversion / prediction construct queries.
https://data.silknow.org/describe/?url=http%3A%2F%2Fwww.w3.org%2Fns%2Fprov%23SoftwareAgent&sid=63 shows the 3 SoftwareAgent we now have. These URI are good for me. Maybe we can remove the counter '/1' as there will be no other versions
An activity made by the XGBoost-classifier software agent like http://data.silknow.org/activity/18a2e08d-0aba-56a3-baee-c9b7b2f33caa SHOULD prov:used the crmdig:D1_Digital_Object object, in this case http://data.silknow.org/object/7d537eb7-7573-3646-b57a-d5c301584d24
All points are now addressed! The counter is removed and the XGBoost-classifier has prov:used set accordingly.
In this google doc, we have discussed 3 proposals for modeling the results of the Image and Text analysis modules in their effort of predicting values for production place, production time, material, technique and depiction.
In terms of modeling, we have decided to adopt the 3rd variant using PROV.
What we need from @lrei and @DominicClermont is the following information:
ecrm:P3_has_note
property attached to theprov:Activity
.Prior to this, @tschleider needs to generate the test files, for each of the 5 dimensions, that will be used to generate predictions. I create below the test file for the Production Time dimension with the following SPARQL query (results):