spice-h2020 / linked-data-hub

2 stars 2 forks source link

Managing custom named graphs on the RDF mirror #29

Closed enridaga closed 2 years ago

enridaga commented 2 years ago

Issue https://github.com/spice-h2020/rdf.uploader/issues/4 has the following effect on the LDH API and UI:

enridaga commented 2 years ago

These queries can be managed with a jobs dataset, LDH users can create and schedule Jobs from the UI, the rdf.uploader will inspect the jobs queue, execute the job, and update its status. See spice-h2020/rdf.uploader#4

Jobs content will be CONSTRUCT queries, targeting a named graph (which is not already in use by the default mirroring procedure).

enridaga commented 2 years ago

Content of the Jobs JSON Document:

JaseMK commented 2 years ago

I'm building this under the SPARQL dataset feature tab on the left, so we'll have multiple top tabs under this feature for, say, 'Query', 'Construct graph', 'Job management' or similar. As such, perhaps it's an idea to rename the left feature tab name from SPARQL to RDF perhaps?

Here's a sample job entry. I've included "job-type": "CONSTRUCT", as we may also want to use this facility for other rdf-related jobs too such as rebuilding the entire rdf replica of the dataset, eg "job-type": "REBUILD". Is there anything else we might need in here?

{
    "_id": "23kl-321jk-gowqm8",
    "dataset": "datasetid",
    "job-type": "CONSTRUCT",
    "query": "CONSTRUCT GRAPH query here...",
    "target-namespace": "spice_datasetid",
    "target-named-graph": "some-graph-name",
    "status": "PENDING",
    "message": "some message 1",
    "history": [
        {
            "message": "some message 2",
            "timestamp": 893478298213
        },
        {
            "message": "some message 3",
            "timestamp": 89347498211
        },
        {
            "message": "some message 4",
            "timestamp": 893478299287
        }
    ],
    "scheduled": 893478298213,
    "submitted-by": "jason.carvalho@open.ac.uk",
    "modified-by": "jason.carvalho@open.ac.uk",
    "_timestamp": 893478298213
}

We can look into cron-like schedules shortly. In the meantime, "scheduled" is a timestamp that indicates when the job should be run which is currently set to 'now' (the timestamp that the form was submitted) but we can pop a date/time picker in there too if required.

The message and message history can be managed by the RDF Uploader, so I'll leave that to be filled in as necessary.

JaseMK commented 2 years ago

@luigi-asprino I've a working interface for submitting CONSTRUCT jobs and then also listing the status of existing jobs. This igenerating JSON job descriptions as per the example above. However, I have a question about the 'target-namespace' attribute that I suggested above.

Currently the RDF namespace is configured in two places. The backend API has config that, for example, tells it to use 'spice_' and the RdfUploader also has similar config. These two pieces of software must align in this respect. Since the interface I'm building at the moment is part of the front-end LDH, it doesn't have access to this config. I'd therefore have to have this RDF namespace configured in a third place, which doesn't seem ideal.

Since the RdfUploader is already able to infer correct RDF namesapce from the dataset ID, would it be sufficient for the LDH interface to simply pass back the dataset ID? The RDFUploader/job-processor should be able to work out the RDF namespace from this in the same way it does when doing JSON->RDF replication. I can leave the attribute (blank, to start with) in the JSON job description if necessary and it can get populated by your system at the appropriate time, if needed.

JaseMK commented 2 years ago

Which values should we expect the status attribute above to take? I am currently developing the interface with all new jobs to have a status as PENDING, but will need to know what other values might appear here so I can have the job list interface respond and render appropriately.

JaseMK commented 2 years ago

SPARQL validation: I can perform some basic SPARQL validation and also query type classification, before committing new jobs to the job store. Am I right in assuming that all new jobs submitted (for now) should consist of CONSTRUCT SPARQL queries only?

luigi-asprino commented 2 years ago

I'm building this under the SPARQL dataset feature tab on the left, so we'll have multiple top tabs under this feature for, say, 'Query', 'Construct graph', 'Job management' or similar. As such, perhaps it's an idea to rename the left feature tab name from SPARQL to RDF perhaps?

I think SPARQL is fine. It is a sort of a SPARQL interface to a dataset which allows you to query and update it.

Here's a sample job entry. I've included "job-type": "CONSTRUCT", as we may also want to use this facility for other rdf-related jobs too such as rebuilding the entire rdf replica of the dataset, eg "job-type": "REBUILD". Is there anything else we might need in here?

If I'm not mistaken, we introduced REBUILD for rebuilding from scratch a document (clearing the graph, triplifying it again and upload to namespace). Maybe the job entry should be slightly different (you don't need a query for that, the target namespace and graph are deduced from the dataset and document ids).

{
    "_id": "23kl-321jk-gowqm8",
    "dataset": "datasetid",
    "job-type": "REBUILD",
        "document-id": "42",
    "status": "PENDING",
    "message": "some message 1",
    "history": [
        {
            "message": "some message 2",
            "timestamp": 893478298213
        },
        {
            "message": "some message 3",
            "timestamp": 89347498211
        },
        {
            "message": "some message 4",
            "timestamp": 893478299287
        }
    ],
    "scheduled": 893478298213,
    "submitted-by": "jason.carvalho@open.ac.uk",
    "modified-by": "jason.carvalho@open.ac.uk",
    "_timestamp": 893478298213
}

@luigi-asprino I've a working interface for submitting CONSTRUCT jobs and then also listing the status of existing jobs. This igenerating JSON job descriptions as per the example above. However, I have a question about the 'target-namespace' attribute that I suggested above.

Currently the RDF namespace is configured in two places. The backend API has config that, for example, tells it to use 'spice_' and the RdfUploader also has similar config. These two pieces of software must align in this respect. Since the interface I'm building at the moment is part of the front-end LDH, it doesn't have access to this config. I'd therefore have to have this RDF namespace configured in a third place, which doesn't seem ideal.

Since the RdfUploader is already able to infer correct RDF namesapce from the dataset ID, would it be sufficient for the LDH interface to simply pass back the dataset ID? The RDFUploader/job-processor should be able to work out the RDF namespace from this in the same way it does when doing JSON->RDF replication. I can leave the attribute (blank, to start with) in the JSON job description if necessary and it can get populated by your system at the appropriate time, if needed.

If I got your point, we can maybe let the front end to communicate to the RDFUploader only in terms of dataset-ids (UUID) and let to RDFUploader the possibility to add a prefix to the dataset id in order to mint a blazegraph namespace.

Which values should we expect the status attribute above to take? I am currently developing the interface with all new jobs to have a status as PENDING, but will need to know what other values might appear here so I can have the job list interface respond and render appropriately.

COMPLETED and ERROR ?

SPARQL validation: I can perform some basic SPARQL validation and also query type classification, before committing new jobs to the job store. Am I right in assuming that all new jobs submitted (for now) should consist of CONSTRUCT SPARQL queries only?

I think so.

JaseMK commented 2 years ago

Job status

The UI expects to see values of: PENDING PROCESSING ERROR COMPLETE

I have added PROCESSING to the list since there are some jobs, such as rebuilding an entire namespace, that could take a long time to run. I guess the process here would be to first update the status of the job document to PROCESSING before commencing the work, then update it to COMPLETE afterward.

JaseMK commented 2 years ago

Job type

Job types submitted from the UI will be one of the following: CONSTRUCT REBUILDGRAPH REBUILDNAMESPACE

Note that, for the REBUILDGRAPH jobs, the document ID/target graph will be populated in the same target-named-graph attribute as used for new graphs.

JaseMK commented 2 years ago

The UI functionality should now be in a state that is ready to test with back-end rdf-uploader batch processing of RDF jobs. I may still make some small UI updates but these will be non-breaking changes that are mainly concerned with the layout.

As well as the latest version of the mkdf/mkdf-sparql module (currently v0.10.1), the changes also include updates to the mkdf/mkdf-stream module and also the mkdf/mkdf-core module. A 'composer update' within the SPICE LDH installation should take care of fetching all of these updated modules.

There is also a small change to the LDH which will need to be pulled from this repository. No changes to the code, but an addition to the config file to specify the name of the rdf-jobs dataset and key to be used. These do not need to be created in advance - they will be created on first use. Note that the repository specifies these in 'config/autoload/local.php.dist' but they will need to be copied into 'config/autoload/local.php' - this latter file is not version controlled here for obvious security reasons.

JaseMK commented 2 years ago

v0.10.2 of mkdf/mkdf-sparql released. Changes as follows: