Closed enridaga closed 2 years ago
These queries can be managed with a jobs
dataset, LDH users can create and schedule Jobs from the UI, the rdf.uploader
will inspect the jobs queue, execute the job, and update its status. See spice-h2020/rdf.uploader#4
Jobs content will be CONSTRUCT queries, targeting a named graph (which is not already in use by the default mirroring procedure).
Content of the Jobs JSON Document:
I'm building this under the SPARQL dataset feature tab on the left, so we'll have multiple top tabs under this feature for, say, 'Query', 'Construct graph', 'Job management' or similar. As such, perhaps it's an idea to rename the left feature tab name from SPARQL to RDF perhaps?
Here's a sample job entry. I've included "job-type": "CONSTRUCT"
, as we may also want to use this facility for other rdf-related jobs too such as rebuilding the entire rdf replica of the dataset, eg "job-type": "REBUILD"
. Is there anything else we might need in here?
{
"_id": "23kl-321jk-gowqm8",
"dataset": "datasetid",
"job-type": "CONSTRUCT",
"query": "CONSTRUCT GRAPH query here...",
"target-namespace": "spice_datasetid",
"target-named-graph": "some-graph-name",
"status": "PENDING",
"message": "some message 1",
"history": [
{
"message": "some message 2",
"timestamp": 893478298213
},
{
"message": "some message 3",
"timestamp": 89347498211
},
{
"message": "some message 4",
"timestamp": 893478299287
}
],
"scheduled": 893478298213,
"submitted-by": "jason.carvalho@open.ac.uk",
"modified-by": "jason.carvalho@open.ac.uk",
"_timestamp": 893478298213
}
We can look into cron-like schedules shortly. In the meantime, "scheduled"
is a timestamp that indicates when the job should be run which is currently set to 'now' (the timestamp that the form was submitted) but we can pop a date/time picker in there too if required.
The message and message history can be managed by the RDF Uploader, so I'll leave that to be filled in as necessary.
@luigi-asprino I've a working interface for submitting CONSTRUCT jobs and then also listing the status of existing jobs. This igenerating JSON job descriptions as per the example above. However, I have a question about the 'target-namespace' attribute that I suggested above.
Currently the RDF namespace is configured in two places. The backend API has config that, for example, tells it to use 'spice_
Since the RdfUploader is already able to infer correct RDF namesapce from the dataset ID, would it be sufficient for the LDH interface to simply pass back the dataset ID? The RDFUploader/job-processor should be able to work out the RDF namespace from this in the same way it does when doing JSON->RDF replication. I can leave the attribute (blank, to start with) in the JSON job description if necessary and it can get populated by your system at the appropriate time, if needed.
Which values should we expect the status
attribute above to take? I am currently developing the interface with all new jobs to have a status as PENDING, but will need to know what other values might appear here so I can have the job list interface respond and render appropriately.
SPARQL validation:
I can perform some basic SPARQL validation and also query type classification, before committing new jobs to the job store. Am I right in assuming that all new jobs submitted (for now) should consist of CONSTRUCT
SPARQL queries only?
I'm building this under the SPARQL dataset feature tab on the left, so we'll have multiple top tabs under this feature for, say, 'Query', 'Construct graph', 'Job management' or similar. As such, perhaps it's an idea to rename the left feature tab name from SPARQL to RDF perhaps?
I think SPARQL is fine. It is a sort of a SPARQL interface to a dataset which allows you to query and update it.
Here's a sample job entry. I've included "job-type": "CONSTRUCT", as we may also want to use this facility for other rdf-related jobs too such as rebuilding the entire rdf replica of the dataset, eg "job-type": "REBUILD". Is there anything else we might need in here?
If I'm not mistaken, we introduced REBUILD for rebuilding from scratch a document (clearing the graph, triplifying it again and upload to namespace). Maybe the job entry should be slightly different (you don't need a query for that, the target namespace and graph are deduced from the dataset and document ids).
{
"_id": "23kl-321jk-gowqm8",
"dataset": "datasetid",
"job-type": "REBUILD",
"document-id": "42",
"status": "PENDING",
"message": "some message 1",
"history": [
{
"message": "some message 2",
"timestamp": 893478298213
},
{
"message": "some message 3",
"timestamp": 89347498211
},
{
"message": "some message 4",
"timestamp": 893478299287
}
],
"scheduled": 893478298213,
"submitted-by": "jason.carvalho@open.ac.uk",
"modified-by": "jason.carvalho@open.ac.uk",
"_timestamp": 893478298213
}
@luigi-asprino I've a working interface for submitting CONSTRUCT jobs and then also listing the status of existing jobs. This igenerating JSON job descriptions as per the example above. However, I have a question about the 'target-namespace' attribute that I suggested above.
Currently the RDF namespace is configured in two places. The backend API has config that, for example, tells it to use 'spice_
' and the RdfUploader also has similar config. These two pieces of software must align in this respect. Since the interface I'm building at the moment is part of the front-end LDH, it doesn't have access to this config. I'd therefore have to have this RDF namespace configured in a third place, which doesn't seem ideal. Since the RdfUploader is already able to infer correct RDF namesapce from the dataset ID, would it be sufficient for the LDH interface to simply pass back the dataset ID? The RDFUploader/job-processor should be able to work out the RDF namespace from this in the same way it does when doing JSON->RDF replication. I can leave the attribute (blank, to start with) in the JSON job description if necessary and it can get populated by your system at the appropriate time, if needed.
If I got your point, we can maybe let the front end to communicate to the RDFUploader only in terms of dataset-ids (UUID) and let to RDFUploader the possibility to add a prefix to the dataset id in order to mint a blazegraph namespace.
Which values should we expect the
status
attribute above to take? I am currently developing the interface with all new jobs to have a status as PENDING, but will need to know what other values might appear here so I can have the job list interface respond and render appropriately.
COMPLETED and ERROR ?
SPARQL validation: I can perform some basic SPARQL validation and also query type classification, before committing new jobs to the job store. Am I right in assuming that all new jobs submitted (for now) should consist of
CONSTRUCT
SPARQL queries only?
I think so.
The UI expects to see values of:
PENDING
PROCESSING
ERROR
COMPLETE
I have added PROCESSING to the list since there are some jobs, such as rebuilding an entire namespace, that could take a long time to run. I guess the process here would be to first update the status of the job document to PROCESSING before commencing the work, then update it to COMPLETE afterward.
Job types submitted from the UI will be one of the following:
CONSTRUCT
REBUILDGRAPH
REBUILDNAMESPACE
Note that, for the REBUILDGRAPH
jobs, the document ID/target graph will be populated in the same target-named-graph
attribute as used for new graphs.
The UI functionality should now be in a state that is ready to test with back-end rdf-uploader batch processing of RDF jobs. I may still make some small UI updates but these will be non-breaking changes that are mainly concerned with the layout.
As well as the latest version of the mkdf/mkdf-sparql module (currently v0.10.1), the changes also include updates to the mkdf/mkdf-stream module and also the mkdf/mkdf-core module. A 'composer update' within the SPICE LDH installation should take care of fetching all of these updated modules.
There is also a small change to the LDH which will need to be pulled from this repository. No changes to the code, but an addition to the config file to specify the name of the rdf-jobs dataset and key to be used. These do not need to be created in advance - they will be created on first use. Note that the repository specifies these in 'config/autoload/local.php.dist' but they will need to be copied into 'config/autoload/local.php' - this latter file is not version controlled here for obvious security reasons.
v0.10.2 of mkdf/mkdf-sparql released. Changes as follows:
document-id
job attributeclear-graph
which will be set to true
or false
. This attribute is also displayed in the job details page, where it exists.target-graph
attribute. For REBUILDGRAPH jobs it will display the document-id
attribute. For REBUILDDATASET jobs, it simply displays the string 'entire dataset'.
Issue https://github.com/spice-h2020/rdf.uploader/issues/4 has the following effect on the LDH API and UI: