vsimko / gama-gateway

Gama Gateway RDF Repository and GAMA data model
0 stars 0 forks source link

Make db adapters report on anomalies and unexpected data for the frontend #26

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Currently the quality control on the output of the database adapters in
very informal, based on quick visual inspection. 

For more quality assessment of the RDF content, more inisght is needed in
the output of the database adapters.  Just browsing in the frontend has
proven to be inusfficient to asses the quality of the RDF data.  

To address this situation the following functionality is proposed : 
The database adapters will after each ingest operation report the
unexpected, or incomplete, data (artists, works, ..) encountered in the
archives. Examples are works without  manifestations, artists without
works,  works without titles, dates without years, Works with > 5
manifestations, etc. 

To achieve this, the following actions are proposed: 

- HKU to create a list of what it expects in the RDF data.

- HfK to update the database adpters to report on the cases (counting and
registering) the cases where ingested data does not meet the expectations
of the frontend. 

-- the reports are made available after an ingest operation.  

If the ingested data is seriously problematic for the functionality of the
frontebd, measures need to be be taken. 
Possible actions  at the database adapter level will be : 
-- filter out the problematic data 
-- insert default values or placeholders for missing data
-- repair data to follow the expected format when creating RDF for a
specific archive. 
-- It may uncover a problem that is ctuall yat the RDF level.
-- It could also be that the frontend expectations are outdated and need to
be adjusted to be able to handle the new batch of archive RDF content. 

Because ingest will be in the development database, solving these problems
thoroughly is possible without causing problems for the Stable Gamma
version. Normally,  the db adapters will never directly create new imports
for the Stable version because then no quality control can take place.  

This issue will be enhanced with a list of expectations regarding archive
data inherent in the current implementation of the frontend.  

Original issue reported on code.google.com by toon...@gmail.com on 12 Mar 2009 at 9:47

GoogleCodeExporter commented 9 years ago
we finally came up wityh a list of recommended quality checks, on which
implementation of a DB adapter quality checker can be based : 

 Expectations of the Portal of the GAMA data.
---------------------------------------------

 Part of GAMA Quality Control.

 A. Completeness of the RDF structure and its values.
 ---------------------------------------------------

 1. Every work has a title Field. This will be by default the Displayable Title.

Proposed is to discuss with archives what to do with works without a title. 
Maybe <no
title>? Or an empty title; empty title > no title?

 2. Every Work has a creator. Wroks without a creator may be invisible for the frontend.

 3. Each work has at least one main manifestation. Works without manifestation may be
invisble for the frontend.

A manifestation doesn't have to have a video counterpart. It can be only 
metadata.

4. Each work has a creation date  with at least the correct year.

5. If a manifestation is the main manifestation ( idx_main=true, then it will 
have a
  stream type 1 or 2), and filmstrip shots for the film strip are available. Also the
shot similarity match data will be present.

6. Every Artist has a name. This will be by default the Displayable Title.

7. Preferably, every artist has one or more works. In case there are
 large numbers of authors without works, it is hard to find works by author so the
frontend may choose to filter these artists out, this filter may or nmay not be
adjustable by  the end user.
The effect is that artists without works could be completely hidden for the end 
user.

8. Data produced during harmonization will be in a different graph so
the frontend can seperate it from the non-harmonized data.

9. No so called Superpersons will be created during harmonization
[pending on discussion Ciant-AGH on the  insertion method  of harmonization 
data.
Howeverm for the frontend superpersons are problemeatic and should be avoided]

10. Every Class uri (not pointing to a simple property) contains
'Work_', 'Artist_', 'Collective_', 'Manifestation_'. This keeps URI type 
detection
sinple.

11. Statistics of completeness
 -------------------------
 Note that not all these expectaions are absolute. It is also about
 ratios : 1 % of artists without works is not an issue, but for
 25 % the frontend may have to change its display policy.
 Therefore, collected statisticical data on the imported data  is also
 very helpful for quality Mnagement.

 ---------------------------------------------------------------------

 I am looking forward to know  how soon (some of) these checks can be implemented in
the database adapter, either for stats collecting or incident tracking. Also 
here
completeness is not the goal for a first iteration of this quality feature. 

Original comment by toon...@gmail.com on 7 Apr 2009 at 12:49

GoogleCodeExporter commented 9 years ago
I started the work on the first approach.
Take a look at the following service:

http://research.ciant.cz/gama/devel/GamaRepository/soa/?service=test/Test_Reposi
tory_Structure&help

Original comment by viliam.s...@gmail.com on 20 Apr 2009 at 1:36

GoogleCodeExporter commented 9 years ago
Request for an Additional statistic to monitor  in Ciant's Quality Measure 
Service:  

 main manifs that have a worktype. 

I noticed that only  < 60  works  with a main manif have a worktype. That is 
very
little. I would expect a much higher count. It suggests incomplete data  / 
import
errors. 

Use  the query below to test : 

PREFIX mysql: <http://www.mysql.com/>
PREFIX gama: <http://gama-gateway.eu/schema/>
PREFIX cache: <http://gama-gateway.eu/cache/>
SELECT  distinct ?worktype count_distinct(?workuri) as ?numworkuris
 WHERE { 
 ?workuri cache:fulltext_works ?cache.
 FILTER (?title = gama:dbAlias(?cache, "wtitle"))
 FILTER (?description = gama:dbAlias(?cache, "wdescr"))
 FILTER (?name  = gama:dbAlias(?cache, "creator"))
?workuri gama:work_type ?worktype.
?workuri gama:has_manifestation ?manifuri.
 ?manifuri  gama:idx_main "1".
} group by ?worktype order by desc(?numworkuris)  limit 10

Original comment by toon...@gmail.com on 21 Apr 2009 at 8:08

GoogleCodeExporter commented 9 years ago
--------------------------------------------------------------------------------
----
Request for an Additional statistic to monitor  in Ciant's Quality Measure 
Service:  
--------------------------------------------------------------------------------
----

Counts per provider 

I would like to see a count per provider, so we can see immediately that FF has 
only
persons imported, and that LBI has no manifestations. 
For a good overview of the coverage per CP my preferance would be to display per
Content Provider: 
-- # of works (and preferably split into Works, Events, Resources) 
-- # of persons 
-- # of Manif 
-- # number of manifs with idx_main =1  or  stream-avail=1 or 2  ('main manifs) 

Original comment by toon...@gmail.com on 21 Apr 2009 at 8:15

GoogleCodeExporter commented 9 years ago
--------------------------------------------------------------------------------
----
Request for an Additional statistic to monitor  in Ciant's Quality Measure 
Service:  
--------------------------------------------------------------------------------
----

 Counts and or  display of suspect Uris : 
A suspect URI is defined here as probably malformed becauce it has no type in 
it, 
eg not Work:  Event:   Resource:  Person: Manifestation:  in it. 
This makes recent problems with malformed subwork uris immediately visible. 
Examples
are gama:ars-electronica:main:13486-253 and 
gama:ars-electronica:main:14059-414 , already fixed. Malformed uris like these 
may
cause bugs in the frontend. 

Original comment by toon...@gmail.com on 21 Apr 2009 at 8:20

GoogleCodeExporter commented 9 years ago
Added "Main manifestations without WorkType" to the service
http://research.ciant.cz/gama/devel/GamaRepository/soa/?service=test/Test_Reposi
tory_Structure&help

Original comment by viliam.s...@gmail.com on 21 Apr 2009 at 10:11

GoogleCodeExporter commented 9 years ago
--------------------------------------------------------------------------------
----
Request for an Additional statistic to monitor in Ciant's Quality Measure 
Service:  
--------------------------------------------------------------------------------
----

Creaion Dates

I noticed that some works have not a creation date. To know how big this 
problem is
and how to solve it, statistics in the statistics list to help in this  would 
be : 
- # works/events wihout creation date 
- # artists without creation date  / life span  (perhaps a  smaller but a 
similar
problem especially for sorting and display in time lines. 

Original comment by toon...@gmail.com on 22 Apr 2009 at 8:00

GoogleCodeExporter commented 9 years ago
I'll probably create better UI for the tests:
added 
- works/events wihout creation date 
- artists without creation date  / life span

http://research.ciant.cz/gama/devel/GamaRepository/endpoint/estimate.php#atests

Original comment by viliam.s...@gmail.com on 22 Apr 2009 at 3:44

GoogleCodeExporter commented 9 years ago
As discussed in the  latest Tech telco, below is the mandatory/optional/unused. 
The focus is here on the Mandatory List because it has impact on website
functionality. Some of these issues have already been mentioned in this issue, 
see
above for details, what follows is a list what he frontend expects to be 
available.  

MANDATORY DATYA FOR THE FRONTEND. 

1. Work
  - title (if unknown, set to UNKNOWN)  (Assumed only 1)
  - created date , at least a year  (Assumed only 1 )
  - creator (Assumed only 1 )
  - work_type  ( still 900 missing!)  (Assumed only 1 )
  - has one main manifestation. (manif.idx_main=1)
  - description.  (if unknown, set to UNKNOWN)  (one or more, lang=X)
  - archive (one)

2.  Manifestation
  --  should have a similarity match calculated with all other main manifs, 

3. Artist: 
  -- has a name 
  -- life_span , at least a year of birth 
  -- has one country (person_country)

@Ursula ,  Villiam
If you feel this presents problems for the FB adapters, the CPs, or both please
provide a plan / workaround how to overcome this. 

Because this list defines certain proeprties currently still missing, the type 
of
this issue  is now set to  'Defect'. 

Only when all the MANDATORY issues are resolved does it make sense to look at 
the
optiona/ not used. These will be added here. 

Original comment by toon...@gmail.com on 18 Jun 2009 at 1:29

GoogleCodeExporter commented 9 years ago
Correction ; the number of missing work_type is 10 times larger, 
so 9000 (!). 
This is currently the biggest problem that *has* to be fixed, otherwise these 
9000
works will never be found in the standard queries that all contain a work_type
filter, ee issue 37 . This makes the missing work_types a high priority issue. 

Original comment by toon...@gmail.com on 18 Jun 2009 at 2:16

GoogleCodeExporter commented 9 years ago
It was agreed in the last technical TelCo that partners should react to issues 
within
3 days, which is over now ... As this issue seems to be quite urgent, could 
someone
from HfG (Ursula, Juergen ...) please comment on this and provide a plan when 
and how
this will be fixed?

Original comment by alu...@gmail.com on 22 Jun 2009 at 8:52

GoogleCodeExporter commented 9 years ago

Reporting currently in place:
1 ursula currently reports by email & Wiki on import statistics, 
This will be automated  but not relevant for the  September release

2. Villiam & Co will make a new page on the wiki with all RDF statistics per CP
combined in a central page, This will replace the current endpoint statistics.  

Original comment by toon...@gmail.com on 17 Aug 2009 at 10:08

GoogleCodeExporter commented 9 years ago
I would like to suggest that the generated report from DB-Adapters be included 
to the
Update/dba directory on our GamaSync FTP site, so that it will describe the XML 
files
there. When the repository is rebuilt a backup is created and it would be nice 
to
have the report on XML files together with the XML files.

Original comment by viliam.s...@gmail.com on 29 Sep 2009 at 1:22

GoogleCodeExporter commented 9 years ago
I'll support the following reports:

_overview.csv
Gives some overview about given data.
Example (from C3):
"Number of artists";450
"Number of persons (gama:Person)";403
"Number of collectives (gama:Collective)";47
"Artists without name (set to ""UNKNOWN"")";0
"Artists without life_span";240
"Artists without person_country";0
"Artists without biography";260
"Artists without person_image";450
"Artists without person_url";306

_anomalies.csv  (for content partners)
All anomalies (=invalid data) will be listed there.
Sample entries (from C3):
workid;property;value;description
123;"worktype";"Experimental Film";"unknown worktype (Artwork)"
455;"creator";380;"artist doesn't exist"

What I still have to implement are reports/lists on missing data (only for 
mandatory
fields). But I don't know if it really does make sense, because in most cases 
more
than 50% is missing (and why should I list for example 800 of 1000 works without
creation date? It should be a known problem of the cp..)

An other overview I could give, would be a list of booleans, like the following 
one:
workid;hasTitle;hasDate;hasLocation;hasDescr;hasImage;hasWebsite;hasWType;has_cr
eator;has_curator;has_lecturer;has_contributor
387;1;1;1;0;0;0;1;0;0;1;0
54;1;1;1;0;0;0;1;0;0;1;0
389;1;1;1;0;0;0;1;0;0;1;0
338;1;1;1;0;0;0;1;0;0;1;0
.
.
.

Original comment by u.kot...@gmail.com on 23 Oct 2009 at 10:40