vsimko / gama-gateway

Gama Gateway RDF Repository and GAMA data model
0 stars 0 forks source link

How to query with harmonized RDF data inserted. #40

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Now or soon harmonization results will be imported in the RDF database
with their own RDF graph. 

When i query for for example for artists of a work , i may get 
original names  and or harmonized names. 

How to query with each of the following strategies : 
1
-- i want only an 'orignal' name to be found. 

2
-- I want a single name per artist, preferrably the harmonized name if
exists, otherwise a single unharmonized name. 

3
-- I only want a harmonized name to be found. 

Note that in practise scenario 2 is the most realistic one, so 1 en 3 
have a low prio. 
But screnario 2 is definitely needed in order to transparantly make use of
and display harmionzied data in the frontend.  

note: Andree and piotr  CC'ed only as FYI

Original issue reported on code.google.com by toon...@gmail.com on 2 Jun 2009 at 2:42

GoogleCodeExporter commented 9 years ago
@villiam:
This not an urgent issue, but it would be nice to have some idea of the work 
involved
( at RDF level, at frontend level)  so we can plan it in. 
First thing , is the approach described feasible to implement in the  the 
sparql 
layer ? 

Original comment by toon...@gmail.com on 18 Jun 2009 at 11:37

GoogleCodeExporter commented 9 years ago
Take a look at the person with URI "gama:instants:main:Person:65"

http://research.ciant.cz/gama/devel/GamaRepository/endpoint/explore.php?uri=gama
%3Ainstants%3Amain%3APerson%3A65

It has already been harmonised. The harmonisation process merged two persons:
- gama:argos:main:Person:8b2ee1de5cb741928b3839d0f6391ce3
- gama:instants:main:Person:65

Original comment by viliam.s...@gmail.com on 25 Jun 2009 at 4:08

GoogleCodeExporter commented 9 years ago

I still have soem questions, what is the bottom line for the frontend, is it 
correct
to assume  
-- that no adaptation of any queries is needed ,
-- that Person:8b2ee1de5cb741928b3839d0f6391ce3 will never both be  in a query 
result
 with Person:65
-- that no contradictions/inconsistencies from (faulty) harmonizations can 
occur? 

Also, what is the consequence for graph names, could these by used as filters 
fro the
scenarios listed above : say, i only want to see harmonized persons  

Original comment by toon...@gmail.com on 1 Jul 2009 at 7:45

GoogleCodeExporter commented 9 years ago
The idea of harmonisation is to merge objects that are equal (eg. equal 
persons) so
as the frontend doesn't need to construct additional queries.
Consider just persons for the sake of simplicity.

After merging two persons, all their metadata are mixed into a single entity.
For example, the new person would contain two or more person_names.
Then, it doesn't matter which name you searched for, the repository will always 
find
URIs of both persons that formed the mixed entity.

In my previous example (Comment 2) about "gama:instants:main:Person:65", the 
person
was merged with "gama:argos:main:Person:8b2ee1de5cb741928b3839d0f6391ce3"
(it is indicated using the http://www.w3.org/2002/07/owl#sameAs property)

You can see 3 person_names comming from 3 different graphs.
- Marcel Broodthaers (comming from gama:argos:main: graph)
- Marcel Broodthaers (comming from gama:instants:main: graph)
- Marcél Broodthaers (comming from http://gama-gateway.eu/harmonisation/ graph)

Now, if you searched for name "Marcel Broodthaers" using this query:
PREFIX gama: <http://gama-gateway.eu/schema/>
SELECT distinct ?person_uri WHERE {
  ?person_uri gama:person_name ?name.
  FILTER (?name = "Marcel Broodthaers")
}

You would get 2 URIs.
- gama:instants:main:Person:65
- gama:argos:main:Person:8b2ee1de5cb741928b3839d0f6391ce3

Also the following "simple load" queries are equivalent, both will return 3 
names:

SIMPLE LOAD
  gama:instants:main:Person:65
PROPERTIES
  gama:person_name

SIMPLE LOAD
  gama:argos:main:Person:8b2ee1de5cb741928b3839d0f6391ce3
PROPERTIES
  gama:person_name

In the frontend you can then decide which name should be presented to the 
end-user.
Preferably the name comming from the http://gama-gateway.eu/harmonisation/ 
graph.

Original comment by viliam.s...@gmail.com on 1 Jul 2009 at 8:25

GoogleCodeExporter commented 9 years ago
Answers your specific questions:

- all existing queries should work without changes (just some "DISTINCT" 
keywords
will be needed in specific situations)

- you just need to decide what should be rendered after you receive results 
from the
repository, based on the graph name

- you can search for merged objects (such as harmonised persons) using the 
owl:sameAs
property.

- I'm a little bit surprised by your question: "that
Person:8b2ee1de5cb741928b3839d0f6391ce3 will never both be in a query result 
with
Person:65", because the main reason of harmonisation is to merge objects so as 
they
appear act as a single object. Therefore query results will always return both 
URIs
(when applicable). But maybe you had some specific scenario in mind. Notice 
that you
can always distinguish between data sources using the graph name.

- inconsistencies/contradistions can occure if the harmonisation tool produces 
them.
The repository is oblivious to the semantics of the harmonisation data. 
Repository
only merges URIs.

Original comment by viliam.s...@gmail.com on 1 Jul 2009 at 8:39

GoogleCodeExporter commented 9 years ago
If I understand this correctly, I am not happy with this solution. Again a lot 
of work is placed on the front-end, and now in 
the final phase of the project. We have stated several times what we would 
like, but never got any satisfying answers.

So here is what we need in the front end:
- Case: A user is added to a work as an artist
* For the preview I need *one* name, the correct name. I don't care about the 
other names in this case. The choice here 
seems easy because of the graph name, and maybe toon can relatively easily 
implement that. This requires that there 
*always* is a http://gama-gateway.eu/harmonisation/ value when there are more 
then one choices. Always.
* For the detailled view things are more difficult. What data do I choose, 
because not all are harmonised as far as I know. 
The biography being the most notable. I can handle multiple biographies, but I 
need them in one person object, because 
otherwise it will be too confusing in the case of a work with multiple artists. 
Example: work x with artist y and z, both harmonised from 3 sources, would get 
me at 6 objects with zero or more 
biographies, taking different languages and empty biographies into account. How 
do I know which are for which person 
unless I start resolving relations?

- Case: You are searching for an artist
* The list of artists
I need a list of all correct names, possibly with harmonised alternatives. 
Since the incorrect names are not used anymore, it 
would be strange to add them here.

* The preview
In this case the biography problem is more important because in the preview I 
can only choose one to show. It would be 
too much to give choices here.

* The detailled view
Same as the detailed view for works.

So basically: we need *one* object with the correct name and the other 
corrected & harmonised values, and all the versions 
of the non-harmonised values, like the biographies. The unharmonised data is 
not ideal, but I understand that harmonising 
biographies might be editorially tricky to say the least.

Anyway, this is what harmonisation does in my world; create one, valid 
alternative for unharmonized data, not add yet 
another choice and throw all possible values on a heap, saying sort it all out 
yourself.

Original comment by charles....@kmt.hku.nl on 1 Jul 2009 at 9:14

GoogleCodeExporter commented 9 years ago
Ok, I'll try to simplify the work with harmonised data.
First approach is a service which removes URIs that point to the same entity 
after
harmonisation.

Try this:
https://research.ciant.cz/gama/devel/GamaRepository/soa/?service=query/Remove_Id
entical_Uris&help

... more to come

Original comment by viliam.s...@gmail.com on 1 Jul 2009 at 7:41

GoogleCodeExporter commented 9 years ago
So how do you feel about this ... would such a service help? And what about 
query
performance? I have the feeling that this could slow down things even more ... 
but
maybe it is not relevant when the portal runs in the same local network.

Would it generally be an option to delegate development of some web service on 
top of
the repository to AGH/Piotr? Mikolaj told me they would be willing. 
Unfortunately
Piotr is on vacation till mid of July.

Original comment by alu...@gmail.com on 3 Jul 2009 at 9:20

GoogleCodeExporter commented 9 years ago
The general  problem with services is that it breaks the current front end 
retrieval
method based on sparql , where data is received in tuples and assembled into 
objects. 

When we would have various services that return data in a different formats, the
query logic and sequence becomes unmanageable. I am not even taking performance 
into
account.  The bottom line is that the frontend middleware is based on a sparql 
based
interface, where a query is a sequence of sparql queries, all returning tuples. 

 So my main point is that any solution outside the rdf database would be  almost
impossible to integrate in the current query process flow. In earlier 
discussions in
Prague I always understood that harmonization would be transparent for the 
frontend,
where the database  would return  harmonized yet complete persons. This is as I 
see
it something that should remain completely in the realm of RDF,  because it is 
exactly what RDF data is intended for:  

Adding new SPARQL filter functions or custom constructs such as SIMPLE LOAD  
have 
solved similar problems eralier on, I would expect this issue can also solved 
on that
level.

But even more preferable is that harmonization results in complete persons that
(usually always) 'shadow' their original persons and have all their properties,
something as Charles also has suggested earlier.  

Original comment by toon...@gmail.com on 3 Jul 2009 at 9:47

GoogleCodeExporter commented 9 years ago
On hiding harmonized data 
======================

I hope to get reactions on this issuer from the people involved in 
harmonization for
what path to take to solve it since it is such a fundamental issue for 
harmonization
efforts. 

For now the immediate issue is that harmonized data shows up in the fontend as 
double
and empty persons. Therefore I propose for now to filter all harmionized data 
out of
the frontend. This could be donme in 2 ways : 

-- by temporarily deleting the complete graph from the database 
-- by filtering it out in the frond end queries by adding a 'not graph HARM'
statement in sparql.   

@Villiam please advice what solution to apply here, such that it can be rolled 
back
easily. 

Original comment by toon...@gmail.com on 7 Jul 2009 at 7:43

GoogleCodeExporter commented 9 years ago
if we need to disable harmonisation data just temporarily, I can turn off the
mechanism that merges entities based on the owl:sameAs property.
Or I can even ignore the whole harmonisation graph.

Anyway, I understand that there are multiple results in the database due to the
harmonisation. But I don't understand why do you see empty persons. 
Harmonisation
only merges entities, so they should contain more data than before. Could you 
provide
an example (URI of the empty person) ?

Original comment by viliam.s...@gmail.com on 7 Jul 2009 at 8:10

GoogleCodeExporter commented 9 years ago
You are  correct, the issue is indeed the multiple occurrence of properties and 
how
to select one, especially for values not normalized. I mixed this issue up with 
an
earlier one causing empty works , empty persons is not the issue.  

The actual problem and possible solution is already clearly described in 
comment 6 by
Charles. 

@Villiam 
Please let us know what you think of Charles' proposed solution in terms of RDf
extensions, and how long this would take for you to implement. Depending on 
this. we
can then take actiosn on hiding the harmonized data for the time being.  

Original comment by toon...@gmail.com on 7 Jul 2009 at 10:29

GoogleCodeExporter commented 9 years ago
Shall we maybe have a call to discuss this? Otherwise I'll put it on the TelCo 
agenda
anyway ...

Original comment by alu...@gmail.com on 9 Jul 2009 at 7:24

GoogleCodeExporter commented 9 years ago
Yes , it is indeed an important issue, but we need  not solve it completely in 
the
Telco. I propose to discuss it in the telco shortly , see if there are any 
latest
developments, and then decide afterwards if a second small telco on this issue 
alone
would be helpful. Or maybe just me/charles and Villiam.... 

Original comment by toon...@gmail.com on 9 Jul 2009 at 7:35

GoogleCodeExporter commented 9 years ago
Colleagues,

May I propose to discuss this at the TelCo but just past 15-th June? Piotr 
Romaniak,
our harmonization expert, will be back to the office then. And I think we could
benefit from his participation in the TC.

Regards,

Mikołaj

Original comment by mikolaj....@gmail.com on 9 Jul 2009 at 6:11

GoogleCodeExporter commented 9 years ago
I'm working on a SPARQL function gama:noEquivalentDuplicates($var) that should 
remove
duplicate results created by the harmonisation mechanism.

Original comment by viliam.s...@gmail.com on 20 Jul 2009 at 9:25

GoogleCodeExporter commented 9 years ago
Villiam, could you be more explicit about the behaviour of the function? What 
happens when in Case A, in 
comment 6 above.

These are the two solutions I was talking about. The first one isn't really 
that great, but I added it for 
completeness.
1)
Every user gets a new field, nameHarmonised for example. That one is filled 
with either the harmonised name, 
or the original name. It is obligated for each person in the database.

This is the easiest solution to get singular names, but is inferior to the 
other solution because for example 
biographies are not gathered together. I am not in favor of this solution.

2)
A new type or subtype is created, the harmonisedUser for example. The name is 
not 100% fitting because it 
will also contain info for persons not harmonised. The type is created for all 
(unique) persons, and is updated 
when new harmonisation data is entered.

This person type will contain:
- for a harmonised person
* One harmonised name
* all years (although the front end wil choose only one)
* all biographies (the front end will display all, depending on language)
* all the works, collectives and other connections of all the equivalent 
persons (through harmonisation)

- a person who is not harmonised yet stays the same, but is added:
* the original name
* the original biography
* the original year
* the original works and collectives

It is important that it exists for all unique persons to avoid the same problem 
where the middleware has to 
infer and interpret the data. This process can be batch-processed each type 
during a harmonisation update.

Original comment by charles....@kmt.hku.nl on 20 Jul 2009 at 9:42

GoogleCodeExporter commented 9 years ago
The problem of harmonisation can actually be divided at least into two parts.
- duplicate objects (URIs) that have been merged
- how to get the right metadata (such as person name)

The first problem should now be solved by the new 
gama:noEquivalentDuplicates(?var)
function. After two URIs are merged, the repository picks one of the URIs and 
makes
it a representantive of the new equivalence cluster.

For example, if we merge URIs: A,B,C,D and E, the repository can pick C as a
representantive. Our function will restrict the query only to the 
representatives and
will ignore the other merged URIs = duplicates.

Check out the difference between the following queries. (first has 6 results 
while
the second had just 5).

PREFIX gama: <http://gama-gateway.eu/schema/>
select * {
  <gama:ars-electronica:main:Event:13017> gama:has_creator ?p.
}

PREFIX gama: <http://gama-gateway.eu/schema/>
select * {
  <gama:ars-electronica:main:Event:13017> gama:has_creator ?p.
  FILTER gama:noEquivalentDuplicates(?p)
}

Original comment by viliam.s...@gmail.com on 20 Jul 2009 at 2:28

GoogleCodeExporter commented 9 years ago
Note: the aforementioned function gama:noEquivalentDuplicates() should be used 
when
counting results or generating a list of objects. An example could be the list 
of
artists of a work.

Original comment by viliam.s...@gmail.com on 21 Jul 2009 at 9:18

GoogleCodeExporter commented 9 years ago
I cannot judge how well this function fits within the middleware, or if it 
makes another query necessary; Toon is 
the one who does all the querying. However, I really wonder why it is a better 
solution to add functions instead a 
clean subtype. What is the problem with the, in my eyes, straightforward 
solution of using the harmonisation 
data to create a harmonised and correct person type?

Original comment by charles....@kmt.hku.nl on 21 Jul 2009 at 12:23

GoogleCodeExporter commented 9 years ago
I think I can generate XML file for harmonisedUser as proposed by Charles in 
comment 17 2).

What I can generate from current harmonization database is:
1. Preffered name of harmoniedUsers
2. All related persons with names and types of relation
3. Works of all related persons
4. Biographies of all related persons
5. Any other data related to persons which are available in RDF repo. This 
requires 
implementation but can be done soon.

What is more I can provide information regarding related groups:
1. Preffered name of harmoniedGroup
2. Related groups with names and types of relation
3. harmoniedUsers that are related (part of) groups

Viliam, could you propose proper XML template? I believe that we can simply add 
new 
attributes (work, biography,...) to current XML template:

  <rdf:Description 
rdf:about="gama:argos:main:Person:3b28bfbd5c5a408b9eab5660e6fe2961">
    <rdf:type rdf:resource="http://gama-gateway.eu/schema/Person"/>
    <gama:person_name>Jayce Salloum [Salloum, Jayce]</gama:person_name>
    <owl:sameAs rdf:resource="gama:heure-exquise:main:Person:1518"/>
    <owl:sameAs rdf:resource="gama:instants:main:Person:668"/>
    <owl:sameAs rdf:resource="gama:montevideo:main:Person:62"/>
  </rdf:Description>
  <rdf:Description 
rdf:about="gama:argos:main:Person:8b2ee1de5cb741928b3839d0f6391ce3">
    <rdf:type rdf:resource="http://gama-gateway.eu/schema/Person"/>
    <gama:person_name>Marcel Broodthaers [Broodthears, Marcel]</gama:person_name>
  </rdf:Description>
  <rdf:Description rdf:about="gama:montevideo:main:Collective:3">
    <rdf:type rdf:resource="http://gama-gateway.eu/schema/Person"/>
    <gama:person_name>Abramovic / Ulay [Abramovic / Ulay]</gama:person_name>
    <gama:has_member rdf:resource="gama:instants:main:Person:1496"/>
  </rdf:Description>
  <rdf:Description rdf:about="gama:montevideo:main:Collective:14896">
    <rdf:type rdf:resource="http://gama-gateway.eu/schema/Person"/>
    <gama:person_name>Lous America, David Garcia, Henk Wijnen & Annie Wright 
[America, Lous, David Garcia, Henk Wijnen & Annie Wright]</gama:person_name>
  </rdf:Description>

Please let me know what do you think about this sollution?

Original comment by piotrrom...@gmail.com on 22 Jul 2009 at 8:48

GoogleCodeExporter commented 9 years ago
It seems that I finally solved the problem with harmonisation and languages. :)

My plan is to use the caching properties for this purpose. It means that the
repository will infer the information so that the query results are immediately 
ready
to be used in the frontend.

For example, there will be a property cache:work_title prepared by the 
repository. It
will be ensured that there is exactly one work title in every language for 
every work.

Another example would be the cache:person_name property. Here, the repository 
would
ensure that there is exacly one person name based on the harmonisation for 
every persons.

The advantages are obvious - speed and simplicity
Disadvantages are the additional caching properties and used storage space.

To summarise:
- some metadata from GAMA Schema will be transformed to caching properties
- repository will perform the transformation
- somebody should define the set of properties that need to be transformed and 
how
they should behave (eg. person_name, work_title, biography, creator...)
- multiple languages can easily be handled
- harmonistaion results will be used during the transformation

What is affected:
- Toon and Charles should define the set of properties for transformation
- I need to implement the transfotmation (caching) in the repository
- JAVA middleware needs to use the new properties and slightly simplified 
queries
- Frontend is not affected (I hope)
- Harmonisation mechanism is not affected
- DB-Adapters are not affected
- Indexing engine is not affected

Original comment by viliam.s...@gmail.com on 23 Jul 2009 at 8:47

GoogleCodeExporter commented 9 years ago
Moreover, if you agree on my proposal, it will also solve the Issue 34

Original comment by viliam.s...@gmail.com on 23 Jul 2009 at 8:51

GoogleCodeExporter commented 9 years ago
Sounds good for me. In case you need any help from my side (harmonization) just 
let 
me know.

Original comment by piotrrom...@gmail.com on 23 Jul 2009 at 9:52

GoogleCodeExporter commented 9 years ago
I am on holiday so unable to check the details but iy looks we will solve lots 
of 
issues with this approach. Thus, as far as I now understand it now I think we 
can 
agree on this solution. However I have not discussed it with Charles yet.

@villiam : 
- somebody should define the set of properties that need to be transformed and 
how
they should behave (eg. person_name, work_title, biography, creator...)
I will be  back in the office on aug 10, then we can define this set and test 
the 
adapted sparql queries and see how it works.   

Original comment by toon...@gmail.com on 27 Jul 2009 at 8:36

GoogleCodeExporter commented 9 years ago
Here is the first caching property that solves the harmonisation problem of 
person names:

"cache:person_name"

It satisfies the following conditions:
- for every person in the repository, there is always a single instance of the 
name
- person name comming from the harmonisation tool overrides all other names
- the name is assigned correctly also to persons marked as equivalent in the
harmonisation tool
- no languages are defined in this property, therefore the name is identical 
for all
languages
- if a name is missing in the source database, an empty string is used instead
- sorting values are also handled properly (similar to the gama:person_name)

Example: Obtain a list of harmonised person names of someone called "Woody"

PREFIX gama: <http://gama-gateway.eu/schema/>
PREFIX cache: <http://gama-gateway.eu/cache/>
select * {
 ?person cache:person_name ?name.
 FILTER gama:match("Woody", ?name)
}

Original comment by viliam.s...@gmail.com on 10 Aug 2009 at 10:15

GoogleCodeExporter commented 9 years ago

One thing still unclear to me is how to deal with the non-harmonized properties 
of a
person considered equal to another.  There are still two persons, what happens 
when i
query non-harmonized properties of the merged person ? I get get eg. 2 bios, 
correct?

These bios alsohave languages, soe one may be filtered out, but not always, 
sometimes
the frontend wants all, sometimes not.  
 This is good material to discuss in the coming Amsterdam meeting  

Original comment by toon...@gmail.com on 14 Aug 2009 at 2:39

GoogleCodeExporter commented 9 years ago

Original comment by alu...@gmail.com on 19 Aug 2009 at 10:26

GoogleCodeExporter commented 9 years ago

Original comment by alu...@gmail.com on 29 Sep 2009 at 1:22