Open GoogleCodeExporter opened 9 years ago
we finally came up wityh a list of recommended quality checks, on which
implementation of a DB adapter quality checker can be based :
Expectations of the Portal of the GAMA data.
---------------------------------------------
Part of GAMA Quality Control.
A. Completeness of the RDF structure and its values.
---------------------------------------------------
1. Every work has a title Field. This will be by default the Displayable Title.
Proposed is to discuss with archives what to do with works without a title.
Maybe <no
title>? Or an empty title; empty title > no title?
2. Every Work has a creator. Wroks without a creator may be invisible for the frontend.
3. Each work has at least one main manifestation. Works without manifestation may be
invisble for the frontend.
A manifestation doesn't have to have a video counterpart. It can be only
metadata.
4. Each work has a creation date with at least the correct year.
5. If a manifestation is the main manifestation ( idx_main=true, then it will
have a
stream type 1 or 2), and filmstrip shots for the film strip are available. Also the
shot similarity match data will be present.
6. Every Artist has a name. This will be by default the Displayable Title.
7. Preferably, every artist has one or more works. In case there are
large numbers of authors without works, it is hard to find works by author so the
frontend may choose to filter these artists out, this filter may or nmay not be
adjustable by the end user.
The effect is that artists without works could be completely hidden for the end
user.
8. Data produced during harmonization will be in a different graph so
the frontend can seperate it from the non-harmonized data.
9. No so called Superpersons will be created during harmonization
[pending on discussion Ciant-AGH on the insertion method of harmonization
data.
Howeverm for the frontend superpersons are problemeatic and should be avoided]
10. Every Class uri (not pointing to a simple property) contains
'Work_', 'Artist_', 'Collective_', 'Manifestation_'. This keeps URI type
detection
sinple.
11. Statistics of completeness
-------------------------
Note that not all these expectaions are absolute. It is also about
ratios : 1 % of artists without works is not an issue, but for
25 % the frontend may have to change its display policy.
Therefore, collected statisticical data on the imported data is also
very helpful for quality Mnagement.
---------------------------------------------------------------------
I am looking forward to know how soon (some of) these checks can be implemented in
the database adapter, either for stats collecting or incident tracking. Also
here
completeness is not the goal for a first iteration of this quality feature.
Original comment by toon...@gmail.com
on 7 Apr 2009 at 12:49
I started the work on the first approach.
Take a look at the following service:
http://research.ciant.cz/gama/devel/GamaRepository/soa/?service=test/Test_Reposi
tory_Structure&help
Original comment by viliam.s...@gmail.com
on 20 Apr 2009 at 1:36
Request for an Additional statistic to monitor in Ciant's Quality Measure
Service:
main manifs that have a worktype.
I noticed that only < 60 works with a main manif have a worktype. That is
very
little. I would expect a much higher count. It suggests incomplete data /
import
errors.
Use the query below to test :
PREFIX mysql: <http://www.mysql.com/>
PREFIX gama: <http://gama-gateway.eu/schema/>
PREFIX cache: <http://gama-gateway.eu/cache/>
SELECT distinct ?worktype count_distinct(?workuri) as ?numworkuris
WHERE {
?workuri cache:fulltext_works ?cache.
FILTER (?title = gama:dbAlias(?cache, "wtitle"))
FILTER (?description = gama:dbAlias(?cache, "wdescr"))
FILTER (?name = gama:dbAlias(?cache, "creator"))
?workuri gama:work_type ?worktype.
?workuri gama:has_manifestation ?manifuri.
?manifuri gama:idx_main "1".
} group by ?worktype order by desc(?numworkuris) limit 10
Original comment by toon...@gmail.com
on 21 Apr 2009 at 8:08
--------------------------------------------------------------------------------
----
Request for an Additional statistic to monitor in Ciant's Quality Measure
Service:
--------------------------------------------------------------------------------
----
Counts per provider
I would like to see a count per provider, so we can see immediately that FF has
only
persons imported, and that LBI has no manifestations.
For a good overview of the coverage per CP my preferance would be to display per
Content Provider:
-- # of works (and preferably split into Works, Events, Resources)
-- # of persons
-- # of Manif
-- # number of manifs with idx_main =1 or stream-avail=1 or 2 ('main manifs)
Original comment by toon...@gmail.com
on 21 Apr 2009 at 8:15
--------------------------------------------------------------------------------
----
Request for an Additional statistic to monitor in Ciant's Quality Measure
Service:
--------------------------------------------------------------------------------
----
Counts and or display of suspect Uris :
A suspect URI is defined here as probably malformed becauce it has no type in
it,
eg not Work: Event: Resource: Person: Manifestation: in it.
This makes recent problems with malformed subwork uris immediately visible.
Examples
are gama:ars-electronica:main:13486-253 and
gama:ars-electronica:main:14059-414 , already fixed. Malformed uris like these
may
cause bugs in the frontend.
Original comment by toon...@gmail.com
on 21 Apr 2009 at 8:20
Added "Main manifestations without WorkType" to the service
http://research.ciant.cz/gama/devel/GamaRepository/soa/?service=test/Test_Reposi
tory_Structure&help
Original comment by viliam.s...@gmail.com
on 21 Apr 2009 at 10:11
--------------------------------------------------------------------------------
----
Request for an Additional statistic to monitor in Ciant's Quality Measure
Service:
--------------------------------------------------------------------------------
----
Creaion Dates
I noticed that some works have not a creation date. To know how big this
problem is
and how to solve it, statistics in the statistics list to help in this would
be :
- # works/events wihout creation date
- # artists without creation date / life span (perhaps a smaller but a
similar
problem especially for sorting and display in time lines.
Original comment by toon...@gmail.com
on 22 Apr 2009 at 8:00
I'll probably create better UI for the tests:
added
- works/events wihout creation date
- artists without creation date / life span
http://research.ciant.cz/gama/devel/GamaRepository/endpoint/estimate.php#atests
Original comment by viliam.s...@gmail.com
on 22 Apr 2009 at 3:44
As discussed in the latest Tech telco, below is the mandatory/optional/unused.
The focus is here on the Mandatory List because it has impact on website
functionality. Some of these issues have already been mentioned in this issue,
see
above for details, what follows is a list what he frontend expects to be
available.
MANDATORY DATYA FOR THE FRONTEND.
1. Work
- title (if unknown, set to UNKNOWN) (Assumed only 1)
- created date , at least a year (Assumed only 1 )
- creator (Assumed only 1 )
- work_type ( still 900 missing!) (Assumed only 1 )
- has one main manifestation. (manif.idx_main=1)
- description. (if unknown, set to UNKNOWN) (one or more, lang=X)
- archive (one)
2. Manifestation
-- should have a similarity match calculated with all other main manifs,
3. Artist:
-- has a name
-- life_span , at least a year of birth
-- has one country (person_country)
@Ursula , Villiam
If you feel this presents problems for the FB adapters, the CPs, or both please
provide a plan / workaround how to overcome this.
Because this list defines certain proeprties currently still missing, the type
of
this issue is now set to 'Defect'.
Only when all the MANDATORY issues are resolved does it make sense to look at
the
optiona/ not used. These will be added here.
Original comment by toon...@gmail.com
on 18 Jun 2009 at 1:29
Correction ; the number of missing work_type is 10 times larger,
so 9000 (!).
This is currently the biggest problem that *has* to be fixed, otherwise these
9000
works will never be found in the standard queries that all contain a work_type
filter, ee issue 37 . This makes the missing work_types a high priority issue.
Original comment by toon...@gmail.com
on 18 Jun 2009 at 2:16
It was agreed in the last technical TelCo that partners should react to issues
within
3 days, which is over now ... As this issue seems to be quite urgent, could
someone
from HfG (Ursula, Juergen ...) please comment on this and provide a plan when
and how
this will be fixed?
Original comment by alu...@gmail.com
on 22 Jun 2009 at 8:52
Reporting currently in place:
1 ursula currently reports by email & Wiki on import statistics,
This will be automated but not relevant for the September release
2. Villiam & Co will make a new page on the wiki with all RDF statistics per CP
combined in a central page, This will replace the current endpoint statistics.
Original comment by toon...@gmail.com
on 17 Aug 2009 at 10:08
I would like to suggest that the generated report from DB-Adapters be included
to the
Update/dba directory on our GamaSync FTP site, so that it will describe the XML
files
there. When the repository is rebuilt a backup is created and it would be nice
to
have the report on XML files together with the XML files.
Original comment by viliam.s...@gmail.com
on 29 Sep 2009 at 1:22
I'll support the following reports:
_overview.csv
Gives some overview about given data.
Example (from C3):
"Number of artists";450
"Number of persons (gama:Person)";403
"Number of collectives (gama:Collective)";47
"Artists without name (set to ""UNKNOWN"")";0
"Artists without life_span";240
"Artists without person_country";0
"Artists without biography";260
"Artists without person_image";450
"Artists without person_url";306
_anomalies.csv (for content partners)
All anomalies (=invalid data) will be listed there.
Sample entries (from C3):
workid;property;value;description
123;"worktype";"Experimental Film";"unknown worktype (Artwork)"
455;"creator";380;"artist doesn't exist"
What I still have to implement are reports/lists on missing data (only for
mandatory
fields). But I don't know if it really does make sense, because in most cases
more
than 50% is missing (and why should I list for example 800 of 1000 works without
creation date? It should be a known problem of the cp..)
An other overview I could give, would be a list of booleans, like the following
one:
workid;hasTitle;hasDate;hasLocation;hasDescr;hasImage;hasWebsite;hasWType;has_cr
eator;has_curator;has_lecturer;has_contributor
387;1;1;1;0;0;0;1;0;0;1;0
54;1;1;1;0;0;0;1;0;0;1;0
389;1;1;1;0;0;0;1;0;0;1;0
338;1;1;1;0;0;0;1;0;0;1;0
.
.
.
Original comment by u.kot...@gmail.com
on 23 Oct 2009 at 10:40
Original issue reported on code.google.com by
toon...@gmail.com
on 12 Mar 2009 at 9:47