Open timrdf opened 10 years ago
Cron currently takes 1.5 hours with cr-retrieve.sh taking 1:15, cr-full-dump taking 10 minutes, and linksets taking 5.
cron-2014-Jan-28_22_57.log:
"2014-01-28T22:57:01+00:00"^^xsd:dateTime <#git-pull>
"2014-01-28T22:57:01+00:00"^^xsd:dateTime <#cr-mirror-ckan>
"2014-01-28T22:57:01+00:00"^^xsd:dateTime <#cr-retrieve>
"2014-01-29T00:12:38+00:00"^^xsd:dateTime <#cr-publish>
"2014-01-29T00:14:16+00:00"^^xsd:dateTime <#cr-full-dump>
"2014-01-29T00:24:39+00:00"^^xsd:dateTime <#cr-linksets>
"2014-01-29T00:28:12+00:00"^^xsd:dateTime <#cr-pingback>
https://github.com/timrdf/csv2rdf4lod-automation/issues/313 could be revived to get the retrieval TIC PROV.
1.5 hours again:
BEGIN cron ps --user prizms Fri Jan 31 16:21:01 UTC 2014
END cron Fri Jan 31 18:56:13 UTC 2014
tic's PROV shows that /retrieval/us/pr-spobal-ng
is the 1.5 hour culprit.
cr-latest-logs.sh | xargs tic.sh
@base <5aa98d9812f3ae4adce9fde3183fbb4d/doc/logs/cron-2014-Jan-31_18_58.log> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
...
<#cr-full-dump>
a prov:Activity ;
prov:startedAtTime "2014-01-31T21:36:34+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cron> .
<#cr-linksets>
a prov:Activity ;
prov:startedAtTime "2014-01-31T21:48:47+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cron> .
<#cr-mirror-ckan>
a prov:Activity ;
prov:startedAtTime "2014-01-31T18:58:02+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cron> .
<#cr-pingback>
a prov:Activity ;
prov:startedAtTime "2014-01-31T21:52:22+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cron> .
<#cr-publish>
a prov:Activity ;
prov:startedAtTime "2014-01-31T21:36:11+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cron> .
<#cr-retrieve>
a prov:Activity ;
prov:startedAtTime "2014-01-31T18:58:02+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cron> .
<#cron>
sio:software-process-identifier "20596" ;
a prov:Activity ;
prov:endedAtTime "2014-01-31T21:52:22+00:00"^^xsd:dateTime ;
prov:startedAtTime "2014-01-31T18:58:01+00:00"^^xsd:dateTime .
<#git-pull>
a prov:Activity ;
prov:startedAtTime "2014-01-31T18:58:01+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cron> .
<../../../retrieval/opendap-org/opendap/svn>
a prov:Activity ;
prov:endedAtTime "2014-01-31T18:58:05+00:00"^^xsd:dateTime ;
prov:startedAtTime "2014-01-31T18:58:05+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cr-retrieve> .
<../../../retrieval/opendap-org/statsvn/2013-Dec-22>
a prov:Activity ;
prov:endedAtTime "2014-01-31T18:58:07+00:00"^^xsd:dateTime ;
prov:startedAtTime "2014-01-31T18:58:07+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cr-retrieve> .
<../../../retrieval/us/cr-isdefinedby>
a prov:Activity ;
prov:endedAtTime "2014-01-31T18:59:18+00:00"^^xsd:dateTime ;
prov:startedAtTime "2014-01-31T18:58:54+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cr-retrieve> .
<../../../retrieval/us/opendap-prov>
a prov:Activity ;
prov:endedAtTime "2014-01-31T18:58:27+00:00"^^xsd:dateTime ;
prov:startedAtTime "2014-01-31T18:58:26+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cr-retrieve> .
<../../../retrieval/us/pr-aggregate-pingbacks>
a prov:Activity ;
prov:endedAtTime "2014-01-31T18:58:51+00:00"^^xsd:dateTime ;
prov:startedAtTime "2014-01-31T18:58:32+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cr-retrieve> .
<../../../retrieval/us/pr-spobal-ng>
a prov:Activity ;
prov:endedAtTime "2014-01-31T21:36:04+00:00"^^xsd:dateTime ;
prov:startedAtTime "2014-01-31T18:59:34+00:00"^^xsd:dateTime ;
prov:wasInformedBy <#cr-retrieve> .
A bug in csv2rdf4lod's NameFactory.java was returning the source-id of the sparql endpoint, instead of the ugly URI for the named graph. Fixed and now summarizing the 132 ngs that we're behind on.
Entire cron is down to 20 minutes. pr-spobal-ng is 5 of it.
How to tighten the response? https://github.com/tetherless-world/opendap/wiki/Use-case:-mockup-tracer#wiki-processing-data-from-opendap-using-http
Unfortunately, there will be a delay between the time that OPeNDAP reports the "has_provenance" and "pingback" URLs, and the time that they are available for request. This is because Prizms uses cron and is not event based. As a stopgap, we'll try to tighten up Prizms' cron so that we can rerun it more regularly than it's current nightly. We'd be happy to hear any suggestions you may have for how to address this current technological limitation.