neontribe / Linked_Development

Linked Development
1 stars 1 forks source link

Duplicates in get documents sparql #42

Closed tobybatch closed 11 years ago

tobybatch commented 11 years ago

Not sure if this is @practicalparticipation or @neil-dabson

This:

select distinct ?article ?also ?dcidentifier ?dctype ?dctitle ?dcdate ?dcabstract ?
dccreator ?dccoverage ?dcpublisher  ?dclanguage ?theme
                    where {
                        ?article a <http://purl.org/ontology/bibo/Article> .
                        ?article <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?also .
                        ?article <http://purl.org/dc/terms/identifier> ?dcidentifier .
                        ?article <http://purl.org/dc/terms/type> ?dctype .
                        ?article <http://purl.org/dc/terms/title> ?dctitle  .
                        ?article <http://purl.org/dc/terms/date> ?dcdate .
                        ?article <http://purl.org/dc/terms/abstract> ?dcabstract .
                        ?article <http://purl.org/dc/terms/creator> ?dccreator .
                        ?article <http://purl.org/dc/terms/coverage> ?dccoverage .
                        ?article <http://purl.org/dc/terms/publisher> ?dcpublisher .
                        ?article <http://purl.org/dc/terms/language> ?dclanguage .
                        ?article <http://purl.org/dc/terms/subject> ?theme .
                        } limit 10

Which equals this:

http://ld.neontribe.org/sparql?default-graph-uri=&query=select+distinct+%3Farticle+%3Falso+%3Fdcidentifier+%3Fdctype+%3Fdctitle+%3Fdcdate+%3Fdcabstract+%3Fdccreator+%3Fdccoverage+%3Fdcpublisher++%3Fdclanguage+%3Ftheme++where+%7B+++++%3Farticle+a+%3Chttp%3A%2F%2Fpurl.org%2Fontology%2Fbibo%2FArticle%3E+.+++++%3Farticle+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23seeAlso%3E+%3Falso+.+++++%3Farticle+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fidentifier%3E+%3Fdcidentifier+.+++++%3Farticle+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Ftype%3E+%3Fdctype+.+++++%3Farticle+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Ftitle%3E+%3Fdctitle++.+++++%3Farticle+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fdate%3E+%3Fdcdate+.+++++%3Farticle+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fabstract%3E+%3Fdcabstract+.+++++%3Farticle+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fcreator%3E+%3Fdccreator+.+++++%3Farticle+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fcoverage%3E+%3Fdccoverage+.+++++%3Farticle+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fpublisher%3E+%3Fdcpublisher+.+++++%3Farticle+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Flanguage%3E+%3Fdclanguage+.+++++%3Farticle+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2Fsubject%3E+%3Ftheme+.+%7D+%0D%0A+limit+10&format=text%2Fhtml&debug=on&timeout=

Produces a lot of duplicates

practicalparticipation commented 11 years ago

This is because there are multiple authors etc. so anytime there is one field different, we've got a different distinct solution.

In these cases Construct queries will return a graph that deals with this duplication:

construct {
                        ?article a <http://purl.org/ontology/bibo/Article> .
                        ?article <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?also .
                        ?article <http://purl.org/dc/terms/identifier> ?dcidentifier .
                        ?article <http://purl.org/dc/terms/type> ?dctype .
                        ?article <http://purl.org/dc/terms/title> ?dctitle  .
                        ?article <http://purl.org/dc/terms/date> ?dcdate .
                        ?article <http://purl.org/dc/terms/abstract> ?dcabstract .
                        ?article <http://purl.org/dc/terms/creator> ?dccreator .
                        ?article <http://purl.org/dc/terms/coverage> ?dccoverage .
                        ?article <http://purl.org/dc/terms/publisher> ?dcpublisher .
                        ?article <http://purl.org/dc/terms/language> ?dclanguage .
                        ?article <http://purl.org/dc/terms/subject> ?theme .
}        where {
                        ?article a <http://purl.org/ontology/bibo/Article> .
                        ?article <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?also .
                        ?article <http://purl.org/dc/terms/identifier> ?dcidentifier .
                        ?article <http://purl.org/dc/terms/type> ?dctype .
                        ?article <http://purl.org/dc/terms/title> ?dctitle  .
                        ?article <http://purl.org/dc/terms/date> ?dcdate .
                        ?article <http://purl.org/dc/terms/abstract> ?dcabstract .
                        ?article <http://purl.org/dc/terms/creator> ?dccreator .
                        ?article <http://purl.org/dc/terms/coverage> ?dccoverage .
                        ?article <http://purl.org/dc/terms/publisher> ?dcpublisher .
                        ?article <http://purl.org/dc/terms/language> ?dclanguage .
                        ?article <http://purl.org/dc/terms/subject> ?theme .
                        } limit 10

Note that the limit here might mean us missing authors, and so limit does not get us 10 articles, but 10 sets of facts about articles.

ToDo: Need to find the right way to page through results from this sort of query...

practicalparticipation commented 11 years ago

It seems a nested query, replacing the first line of the WHERE clause above with


  { SELECT ?article {
        ?article a <http://purl.org/ontology/bibo/Article> .
      }
      LIMIT 10
     }

sort of works. Although:

So -

But -

   FILTER(?article = <http://linked-development.org/eldis/output/A63724/> || ?article = <http://linked-development.org/eldis/output/A63619/>)

This select, then construct approach is the one taken by the Puelia linked data API in place at http://education.data.gov.uk/doc/school (see the queries at the bottom of the screen for how that is fetching a list of items and then querying for the details of them.

practicalparticipation commented 11 years ago

The nested query approach does seem to be the best to take. I'm not seeing the same performance issues on Virtuoso 6.1.6 for this.