Closed GoogleCodeExporter closed 9 years ago
@villiam:
This not an urgent issue, but it would be nice to have some idea of the work
involved
( at RDF level, at frontend level) so we can plan it in.
First thing , is the approach described feasible to implement in the the
sparql
layer ?
Original comment by toon...@gmail.com
on 18 Jun 2009 at 11:37
Take a look at the person with URI "gama:instants:main:Person:65"
http://research.ciant.cz/gama/devel/GamaRepository/endpoint/explore.php?uri=gama
%3Ainstants%3Amain%3APerson%3A65
It has already been harmonised. The harmonisation process merged two persons:
- gama:argos:main:Person:8b2ee1de5cb741928b3839d0f6391ce3
- gama:instants:main:Person:65
Original comment by viliam.s...@gmail.com
on 25 Jun 2009 at 4:08
I still have soem questions, what is the bottom line for the frontend, is it
correct
to assume
-- that no adaptation of any queries is needed ,
-- that Person:8b2ee1de5cb741928b3839d0f6391ce3 will never both be in a query
result
with Person:65
-- that no contradictions/inconsistencies from (faulty) harmonizations can
occur?
Also, what is the consequence for graph names, could these by used as filters
fro the
scenarios listed above : say, i only want to see harmonized persons
Original comment by toon...@gmail.com
on 1 Jul 2009 at 7:45
The idea of harmonisation is to merge objects that are equal (eg. equal
persons) so
as the frontend doesn't need to construct additional queries.
Consider just persons for the sake of simplicity.
After merging two persons, all their metadata are mixed into a single entity.
For example, the new person would contain two or more person_names.
Then, it doesn't matter which name you searched for, the repository will always
find
URIs of both persons that formed the mixed entity.
In my previous example (Comment 2) about "gama:instants:main:Person:65", the
person
was merged with "gama:argos:main:Person:8b2ee1de5cb741928b3839d0f6391ce3"
(it is indicated using the http://www.w3.org/2002/07/owl#sameAs property)
You can see 3 person_names comming from 3 different graphs.
- Marcel Broodthaers (comming from gama:argos:main: graph)
- Marcel Broodthaers (comming from gama:instants:main: graph)
- Marcél Broodthaers (comming from http://gama-gateway.eu/harmonisation/ graph)
Now, if you searched for name "Marcel Broodthaers" using this query:
PREFIX gama: <http://gama-gateway.eu/schema/>
SELECT distinct ?person_uri WHERE {
?person_uri gama:person_name ?name.
FILTER (?name = "Marcel Broodthaers")
}
You would get 2 URIs.
- gama:instants:main:Person:65
- gama:argos:main:Person:8b2ee1de5cb741928b3839d0f6391ce3
Also the following "simple load" queries are equivalent, both will return 3
names:
SIMPLE LOAD
gama:instants:main:Person:65
PROPERTIES
gama:person_name
SIMPLE LOAD
gama:argos:main:Person:8b2ee1de5cb741928b3839d0f6391ce3
PROPERTIES
gama:person_name
In the frontend you can then decide which name should be presented to the
end-user.
Preferably the name comming from the http://gama-gateway.eu/harmonisation/
graph.
Original comment by viliam.s...@gmail.com
on 1 Jul 2009 at 8:25
Answers your specific questions:
- all existing queries should work without changes (just some "DISTINCT"
keywords
will be needed in specific situations)
- you just need to decide what should be rendered after you receive results
from the
repository, based on the graph name
- you can search for merged objects (such as harmonised persons) using the
owl:sameAs
property.
- I'm a little bit surprised by your question: "that
Person:8b2ee1de5cb741928b3839d0f6391ce3 will never both be in a query result
with
Person:65", because the main reason of harmonisation is to merge objects so as
they
appear act as a single object. Therefore query results will always return both
URIs
(when applicable). But maybe you had some specific scenario in mind. Notice
that you
can always distinguish between data sources using the graph name.
- inconsistencies/contradistions can occure if the harmonisation tool produces
them.
The repository is oblivious to the semantics of the harmonisation data.
Repository
only merges URIs.
Original comment by viliam.s...@gmail.com
on 1 Jul 2009 at 8:39
If I understand this correctly, I am not happy with this solution. Again a lot
of work is placed on the front-end, and now in
the final phase of the project. We have stated several times what we would
like, but never got any satisfying answers.
So here is what we need in the front end:
- Case: A user is added to a work as an artist
* For the preview I need *one* name, the correct name. I don't care about the
other names in this case. The choice here
seems easy because of the graph name, and maybe toon can relatively easily
implement that. This requires that there
*always* is a http://gama-gateway.eu/harmonisation/ value when there are more
then one choices. Always.
* For the detailled view things are more difficult. What data do I choose,
because not all are harmonised as far as I know.
The biography being the most notable. I can handle multiple biographies, but I
need them in one person object, because
otherwise it will be too confusing in the case of a work with multiple artists.
Example: work x with artist y and z, both harmonised from 3 sources, would get
me at 6 objects with zero or more
biographies, taking different languages and empty biographies into account. How
do I know which are for which person
unless I start resolving relations?
- Case: You are searching for an artist
* The list of artists
I need a list of all correct names, possibly with harmonised alternatives.
Since the incorrect names are not used anymore, it
would be strange to add them here.
* The preview
In this case the biography problem is more important because in the preview I
can only choose one to show. It would be
too much to give choices here.
* The detailled view
Same as the detailed view for works.
So basically: we need *one* object with the correct name and the other
corrected & harmonised values, and all the versions
of the non-harmonised values, like the biographies. The unharmonised data is
not ideal, but I understand that harmonising
biographies might be editorially tricky to say the least.
Anyway, this is what harmonisation does in my world; create one, valid
alternative for unharmonized data, not add yet
another choice and throw all possible values on a heap, saying sort it all out
yourself.
Original comment by charles....@kmt.hku.nl
on 1 Jul 2009 at 9:14
Ok, I'll try to simplify the work with harmonised data.
First approach is a service which removes URIs that point to the same entity
after
harmonisation.
Try this:
https://research.ciant.cz/gama/devel/GamaRepository/soa/?service=query/Remove_Id
entical_Uris&help
... more to come
Original comment by viliam.s...@gmail.com
on 1 Jul 2009 at 7:41
So how do you feel about this ... would such a service help? And what about
query
performance? I have the feeling that this could slow down things even more ...
but
maybe it is not relevant when the portal runs in the same local network.
Would it generally be an option to delegate development of some web service on
top of
the repository to AGH/Piotr? Mikolaj told me they would be willing.
Unfortunately
Piotr is on vacation till mid of July.
Original comment by alu...@gmail.com
on 3 Jul 2009 at 9:20
The general problem with services is that it breaks the current front end
retrieval
method based on sparql , where data is received in tuples and assembled into
objects.
When we would have various services that return data in a different formats, the
query logic and sequence becomes unmanageable. I am not even taking performance
into
account. The bottom line is that the frontend middleware is based on a sparql
based
interface, where a query is a sequence of sparql queries, all returning tuples.
So my main point is that any solution outside the rdf database would be almost
impossible to integrate in the current query process flow. In earlier
discussions in
Prague I always understood that harmonization would be transparent for the
frontend,
where the database would return harmonized yet complete persons. This is as I
see
it something that should remain completely in the realm of RDF, because it is
exactly what RDF data is intended for:
Adding new SPARQL filter functions or custom constructs such as SIMPLE LOAD
have
solved similar problems eralier on, I would expect this issue can also solved
on that
level.
But even more preferable is that harmonization results in complete persons that
(usually always) 'shadow' their original persons and have all their properties,
something as Charles also has suggested earlier.
Original comment by toon...@gmail.com
on 3 Jul 2009 at 9:47
On hiding harmonized data
======================
I hope to get reactions on this issuer from the people involved in
harmonization for
what path to take to solve it since it is such a fundamental issue for
harmonization
efforts.
For now the immediate issue is that harmonized data shows up in the fontend as
double
and empty persons. Therefore I propose for now to filter all harmionized data
out of
the frontend. This could be donme in 2 ways :
-- by temporarily deleting the complete graph from the database
-- by filtering it out in the frond end queries by adding a 'not graph HARM'
statement in sparql.
@Villiam please advice what solution to apply here, such that it can be rolled
back
easily.
Original comment by toon...@gmail.com
on 7 Jul 2009 at 7:43
if we need to disable harmonisation data just temporarily, I can turn off the
mechanism that merges entities based on the owl:sameAs property.
Or I can even ignore the whole harmonisation graph.
Anyway, I understand that there are multiple results in the database due to the
harmonisation. But I don't understand why do you see empty persons.
Harmonisation
only merges entities, so they should contain more data than before. Could you
provide
an example (URI of the empty person) ?
Original comment by viliam.s...@gmail.com
on 7 Jul 2009 at 8:10
You are correct, the issue is indeed the multiple occurrence of properties and
how
to select one, especially for values not normalized. I mixed this issue up with
an
earlier one causing empty works , empty persons is not the issue.
The actual problem and possible solution is already clearly described in
comment 6 by
Charles.
@Villiam
Please let us know what you think of Charles' proposed solution in terms of RDf
extensions, and how long this would take for you to implement. Depending on
this. we
can then take actiosn on hiding the harmonized data for the time being.
Original comment by toon...@gmail.com
on 7 Jul 2009 at 10:29
Shall we maybe have a call to discuss this? Otherwise I'll put it on the TelCo
agenda
anyway ...
Original comment by alu...@gmail.com
on 9 Jul 2009 at 7:24
Yes , it is indeed an important issue, but we need not solve it completely in
the
Telco. I propose to discuss it in the telco shortly , see if there are any
latest
developments, and then decide afterwards if a second small telco on this issue
alone
would be helpful. Or maybe just me/charles and Villiam....
Original comment by toon...@gmail.com
on 9 Jul 2009 at 7:35
Colleagues,
May I propose to discuss this at the TelCo but just past 15-th June? Piotr
Romaniak,
our harmonization expert, will be back to the office then. And I think we could
benefit from his participation in the TC.
Regards,
Mikołaj
Original comment by mikolaj....@gmail.com
on 9 Jul 2009 at 6:11
I'm working on a SPARQL function gama:noEquivalentDuplicates($var) that should
remove
duplicate results created by the harmonisation mechanism.
Original comment by viliam.s...@gmail.com
on 20 Jul 2009 at 9:25
Villiam, could you be more explicit about the behaviour of the function? What
happens when in Case A, in
comment 6 above.
These are the two solutions I was talking about. The first one isn't really
that great, but I added it for
completeness.
1)
Every user gets a new field, nameHarmonised for example. That one is filled
with either the harmonised name,
or the original name. It is obligated for each person in the database.
This is the easiest solution to get singular names, but is inferior to the
other solution because for example
biographies are not gathered together. I am not in favor of this solution.
2)
A new type or subtype is created, the harmonisedUser for example. The name is
not 100% fitting because it
will also contain info for persons not harmonised. The type is created for all
(unique) persons, and is updated
when new harmonisation data is entered.
This person type will contain:
- for a harmonised person
* One harmonised name
* all years (although the front end wil choose only one)
* all biographies (the front end will display all, depending on language)
* all the works, collectives and other connections of all the equivalent
persons (through harmonisation)
- a person who is not harmonised yet stays the same, but is added:
* the original name
* the original biography
* the original year
* the original works and collectives
It is important that it exists for all unique persons to avoid the same problem
where the middleware has to
infer and interpret the data. This process can be batch-processed each type
during a harmonisation update.
Original comment by charles....@kmt.hku.nl
on 20 Jul 2009 at 9:42
The problem of harmonisation can actually be divided at least into two parts.
- duplicate objects (URIs) that have been merged
- how to get the right metadata (such as person name)
The first problem should now be solved by the new
gama:noEquivalentDuplicates(?var)
function. After two URIs are merged, the repository picks one of the URIs and
makes
it a representantive of the new equivalence cluster.
For example, if we merge URIs: A,B,C,D and E, the repository can pick C as a
representantive. Our function will restrict the query only to the
representatives and
will ignore the other merged URIs = duplicates.
Check out the difference between the following queries. (first has 6 results
while
the second had just 5).
PREFIX gama: <http://gama-gateway.eu/schema/>
select * {
<gama:ars-electronica:main:Event:13017> gama:has_creator ?p.
}
PREFIX gama: <http://gama-gateway.eu/schema/>
select * {
<gama:ars-electronica:main:Event:13017> gama:has_creator ?p.
FILTER gama:noEquivalentDuplicates(?p)
}
Original comment by viliam.s...@gmail.com
on 20 Jul 2009 at 2:28
Note: the aforementioned function gama:noEquivalentDuplicates() should be used
when
counting results or generating a list of objects. An example could be the list
of
artists of a work.
Original comment by viliam.s...@gmail.com
on 21 Jul 2009 at 9:18
I cannot judge how well this function fits within the middleware, or if it
makes another query necessary; Toon is
the one who does all the querying. However, I really wonder why it is a better
solution to add functions instead a
clean subtype. What is the problem with the, in my eyes, straightforward
solution of using the harmonisation
data to create a harmonised and correct person type?
Original comment by charles....@kmt.hku.nl
on 21 Jul 2009 at 12:23
I think I can generate XML file for harmonisedUser as proposed by Charles in
comment 17 2).
What I can generate from current harmonization database is:
1. Preffered name of harmoniedUsers
2. All related persons with names and types of relation
3. Works of all related persons
4. Biographies of all related persons
5. Any other data related to persons which are available in RDF repo. This
requires
implementation but can be done soon.
What is more I can provide information regarding related groups:
1. Preffered name of harmoniedGroup
2. Related groups with names and types of relation
3. harmoniedUsers that are related (part of) groups
Viliam, could you propose proper XML template? I believe that we can simply add
new
attributes (work, biography,...) to current XML template:
<rdf:Description
rdf:about="gama:argos:main:Person:3b28bfbd5c5a408b9eab5660e6fe2961">
<rdf:type rdf:resource="http://gama-gateway.eu/schema/Person"/>
<gama:person_name>Jayce Salloum [Salloum, Jayce]</gama:person_name>
<owl:sameAs rdf:resource="gama:heure-exquise:main:Person:1518"/>
<owl:sameAs rdf:resource="gama:instants:main:Person:668"/>
<owl:sameAs rdf:resource="gama:montevideo:main:Person:62"/>
</rdf:Description>
<rdf:Description
rdf:about="gama:argos:main:Person:8b2ee1de5cb741928b3839d0f6391ce3">
<rdf:type rdf:resource="http://gama-gateway.eu/schema/Person"/>
<gama:person_name>Marcel Broodthaers [Broodthears, Marcel]</gama:person_name>
</rdf:Description>
<rdf:Description rdf:about="gama:montevideo:main:Collective:3">
<rdf:type rdf:resource="http://gama-gateway.eu/schema/Person"/>
<gama:person_name>Abramovic / Ulay [Abramovic / Ulay]</gama:person_name>
<gama:has_member rdf:resource="gama:instants:main:Person:1496"/>
</rdf:Description>
<rdf:Description rdf:about="gama:montevideo:main:Collective:14896">
<rdf:type rdf:resource="http://gama-gateway.eu/schema/Person"/>
<gama:person_name>Lous America, David Garcia, Henk Wijnen & Annie Wright
[America, Lous, David Garcia, Henk Wijnen & Annie Wright]</gama:person_name>
</rdf:Description>
Please let me know what do you think about this sollution?
Original comment by piotrrom...@gmail.com
on 22 Jul 2009 at 8:48
It seems that I finally solved the problem with harmonisation and languages. :)
My plan is to use the caching properties for this purpose. It means that the
repository will infer the information so that the query results are immediately
ready
to be used in the frontend.
For example, there will be a property cache:work_title prepared by the
repository. It
will be ensured that there is exactly one work title in every language for
every work.
Another example would be the cache:person_name property. Here, the repository
would
ensure that there is exacly one person name based on the harmonisation for
every persons.
The advantages are obvious - speed and simplicity
Disadvantages are the additional caching properties and used storage space.
To summarise:
- some metadata from GAMA Schema will be transformed to caching properties
- repository will perform the transformation
- somebody should define the set of properties that need to be transformed and
how
they should behave (eg. person_name, work_title, biography, creator...)
- multiple languages can easily be handled
- harmonistaion results will be used during the transformation
What is affected:
- Toon and Charles should define the set of properties for transformation
- I need to implement the transfotmation (caching) in the repository
- JAVA middleware needs to use the new properties and slightly simplified
queries
- Frontend is not affected (I hope)
- Harmonisation mechanism is not affected
- DB-Adapters are not affected
- Indexing engine is not affected
Original comment by viliam.s...@gmail.com
on 23 Jul 2009 at 8:47
Moreover, if you agree on my proposal, it will also solve the Issue 34
Original comment by viliam.s...@gmail.com
on 23 Jul 2009 at 8:51
Sounds good for me. In case you need any help from my side (harmonization) just
let
me know.
Original comment by piotrrom...@gmail.com
on 23 Jul 2009 at 9:52
I am on holiday so unable to check the details but iy looks we will solve lots
of
issues with this approach. Thus, as far as I now understand it now I think we
can
agree on this solution. However I have not discussed it with Charles yet.
@villiam :
- somebody should define the set of properties that need to be transformed and
how
they should behave (eg. person_name, work_title, biography, creator...)
I will be back in the office on aug 10, then we can define this set and test
the
adapted sparql queries and see how it works.
Original comment by toon...@gmail.com
on 27 Jul 2009 at 8:36
Here is the first caching property that solves the harmonisation problem of
person names:
"cache:person_name"
It satisfies the following conditions:
- for every person in the repository, there is always a single instance of the
name
- person name comming from the harmonisation tool overrides all other names
- the name is assigned correctly also to persons marked as equivalent in the
harmonisation tool
- no languages are defined in this property, therefore the name is identical
for all
languages
- if a name is missing in the source database, an empty string is used instead
- sorting values are also handled properly (similar to the gama:person_name)
Example: Obtain a list of harmonised person names of someone called "Woody"
PREFIX gama: <http://gama-gateway.eu/schema/>
PREFIX cache: <http://gama-gateway.eu/cache/>
select * {
?person cache:person_name ?name.
FILTER gama:match("Woody", ?name)
}
Original comment by viliam.s...@gmail.com
on 10 Aug 2009 at 10:15
One thing still unclear to me is how to deal with the non-harmonized properties
of a
person considered equal to another. There are still two persons, what happens
when i
query non-harmonized properties of the merged person ? I get get eg. 2 bios,
correct?
These bios alsohave languages, soe one may be filtered out, but not always,
sometimes
the frontend wants all, sometimes not.
This is good material to discuss in the coming Amsterdam meeting
Original comment by toon...@gmail.com
on 14 Aug 2009 at 2:39
Original comment by alu...@gmail.com
on 19 Aug 2009 at 10:26
Original comment by alu...@gmail.com
on 29 Sep 2009 at 1:22
Original issue reported on code.google.com by
toon...@gmail.com
on 2 Jun 2009 at 2:42