Closed HimmelStein closed 7 years ago
"each pipeline would have to end with HTTP POST containing the resulting data to a "clearance" pipeline that could perform some safety checks, such as testing whether the dataset IRI is unique or whether the DCV integrity constraints are satisfied. This way we can offload some of the burden on ETL developers."
Actually, this is why we distinguish staging and production triplestore - all pipelines should output their data to the staging triplestore, where such quality checks can be performed by independent pipelines and datasets passing the checks can then be pushed to the production triplestore. The HTTP post is only meant for the OS data push process.
As to the comment in the other thread about developers being only humans - while this is true and mistakes can happen 1) The pipelines are first debugged on local machines where the developer should notice such issues 2) The datasets are separated by named graphs, so the errorneous pipeline can only mess up one other preexisting dataset. When this happens, since we have all the production pipelines in place, the pipeline can be simple ran again for both the destroyed and the new datasets. For me, it does not seem to be worth creating a special workflow and pipelines to avoid this as this seems quite rare now.
Since LP-ETL doesn't support named graphs, we'd need to enforce another convention: named graph IRI is the same as the IRI of the qb:DataSet
instance. Otherwise, ETL pipelines would have to push the named graph IRI besides the actual data.
@jakubklimek: How would you trigger the quality checking pipeline if not via the HTTP POST component?
Since LP-ETL doesn't support named graphs, we'd need to enforce another convention: named graph IRI is the same as the IRI of the qb:DataSet instance. Otherwise, ETL pipelines would have to push the named graph IRI besides the actual data.
Yes, it should be either the same, or deducible from it. Similar rules should go i.e. for metadata "/metadata" and possibly even component property definitions "/vocabulary".
As to the triggering, this does not have to be automatically triggered immediately after the pipeline finishes. Either the developer wants to do the check immediately and therefore can run the pipeline, or not and then such a pipeline can run e.g. every night.
Manual execution of the clearance pipeline requires its reconfiguration. Its input needs to be changed. So the clearance pipeline fragment must be imported and reconfigured. Since it requires this manual effort, it removes some of the benefits of automation. Moreover, the clearance pipeline can be the same for all datasets, but this approach would require having multiple copies of it.
Scheduling the clearance pipeline has several problems. First, as far as I know, LP-ETL currently doesn't have a scheduler. Second, you would need to be able to determine which datasets should be checked (e.g., the ones that changed from last midnight) in order not to check all datasets.
Manual execution of the clearance pipeline requires its reconfiguration. Its input needs to be changed. So the clearance pipeline fragment must be imported and reconfigured. Since it requires this manual effort, it removes some of the benefits of automation.
Sure, but it means changing 1 IRI in the local copy of such pipeline, which is, in comparison to the effort invested in creating the pipeline, negligible.
Moreover, the clearance pipeline can be the same for all datasets, but this approach would require having multiple copies of it.
You can have one and reconfigure it for individual checks.
First, as far as I know, LP-ETL currently doesn't have a scheduler.
This is by design. Pipelines can be run using HTTP API, therefore, the scheduling agenda can be left to existing sophisticated schedulers outside of LP-ETL e.g. cron.
Second, you would need to be able to determine which datasets should be checked (e.g., the ones that changed from last midnight) in order not to check all datasets.
I think all the datasets should be checked every time exactly to determine some unintentional changes etc.
Overall, from the point of LP-ETL, ensuring automated data quality checks is a bigger topic that deserves a more systematic support in LP-ETL than including a trigger directly into the pipeline itself (e.g. a more formal designation of output data to be checked and assignment of a quality checking pipeline producing provenance information).
Sure, but it means changing 1 IRI in the local copy of such pipeline, which is, in comparison to the effort invested in creating the pipeline, negligible.
What would this IRI point to? Dump of the dataset? The clearance needs more, including DSD and code lists.
You can have one and reconfigure it for individual checks.
Sure, but then you will have the problems of mutable data. If there are two ETL developers modifying a single pipeline, conflicts can arise.
This is by design. Pipelines can be run using HTTP API, therefore, the scheduling agenda can be left to existing sophisticated schedulers outside of LP-ETL e.g. cron.
I see.
I think all the datasets should be checked every time exactly to determine some unintentional changes etc.
It does not matter whether the changes were intentional or unintentional. All changed datasets should be checked. However, checking all datasets may quickly become too demanding as the number of datasets grows. We should detect which datasets changed (e.g., using a last modification date that could be set automatically for each execution in the metadata components) and only check those.
What would this IRI point to? Dump of the dataset? The clearance needs more, including DSD and code lists.
It would contain the dataset IRI, all other IRIs should be deducible. Originally, I thought these checks would be performed by the pipeline that would move a dataset from the staging to the production triplestore, so it would have all the info it needs. Is there any timeline for this since we just decided that staging = production for now and therefore this approach is not applicable? I think having a clearance pipeline even before staging is a bit overkill for OBEU.
Sure, but then you will have the problems of mutable data. If there are two ETL developers modifying a single pipeline, conflicts can arise.
Yes, conflicts can arise. All I am saying is that to me they seem so improbable and cheap to repair, that it is not worth investing effort in their prevention.
It does not matter whether the changes were intentional or unintentional. All changed datasets should be checked. However, checking all datasets may quickly become too demanding as the number of datasets grows. We should detect which datasets changed (e.g., using a last modification date that could be set automatically for each execution in the metadata components) and only check those.
If the change is unintentional, it does not have to be reflected in the dataset metadata modified date. Also, from the discussions here it seems that the clearance pipeline would only be applied to the manually created pipelines, I don't expect their number to rise this much.
It would contain the dataset IRI, all other IRIs should be deducible.
This may be true for the metadata graph, but not for code lists. We can also think about adhering to the linked data approach and dereference all data needed for clearance. However, if we do the clearance before pushing to production, the IRIs will not be dereferenceable, so some IRI rewriting would have to be invented, which would further complicate things.
Originally, I thought these checks would be performed by the pipeline that would move a dataset from the staging to the production triplestore, so it would have all the info it needs.
Yes, that would be ideal. However, since we decided (for the moment) not to have this pipeline, and sync Fuseki database files directly, it is not applicable.
Is there any timeline for this since we just decided that staging = production for now and therefore this approach is not applicable? I think having a clearance pipeline even before staging is a bit overkill for OBEU.
I think this is a nice to have at this moment. However, @HimmelStein might think otherwise.
If the change is unintentional, it does not have to be reflected in the dataset metadata modified date.
If the last modified date is set automatically by the metadata components, then even unintentional change is detected, unless we there is a change done directly in the files produced by the pipeline.
This may be true for the metadata graph, but not for code lists.
True. But then this can be parsed from the DSD (qb:codeList object + deduction), right?
We can also think about adhering to the linked data approach and dereference all data needed for clearance. However, if we do the clearance before pushing to production, the IRIs will not be dereferenceable, so some IRI rewriting would have to be invented, which would further complicate things.
Yeah, too complicated.
If the last modified date is set automatically by the metadata components, then even unintentional change is detected, unless we there is a change done directly in the files produced by the pipeline.
Yes, I was thinking about the errors caused directly or by new faulty pipelines (e.g. by merging incorrect graph etc.) rather than by accidental run of an older pipeline.
So I think the question is what exactly is needed and when as there are many "nice to have" things that are not really needed or save less effort that they require.
But then this can be parsed from the DSD (qb:codeList object + deduction), right?
Yes, but dereferencing the extracted IRIs might not work, as described above.
Thanks for the active feedback for the case (2) RDF datasets developed by ETL developers. Let us simplify it -- general propose quality checking of RDF dataset is 'nice to have', but will cost too much efforts to develop. The existing quality checking pipelines shall be configured before applying, so it is a task for individual ETL developers. Then, I will continue to spend efforts to manually check the quality of the FDP2RDF pipleline by running concrete examples.
@badmotor @HimmelStein @jakubklimek @jindrichmynarz
I'm closing this. It must be out of date, and any specific issues can be raised with distinct issues.
@jindrichmynarz @jakubklimek @marek-dudas @akariv @mlukasch @larjohn @badmotor as we discussed at https://github.com/openbudgets/platform/issues/9#issuecomment-231854081, there will be RDF qualify checking pipelines at Fraunhofer server, to guarantee the quality of the RDF dataset. As suggested by Jindrich: "each pipeline would have to end with HTTP POST containing the resulting data to a "clearance" pipeline that could perform some safety checks, such as testing whether the dataset IRI is unique or whether the DCV integrity constraints are satisfied. This way we can offload some of the burden on ETL developers."
We distinguish two kinds of RDF datasets: (1) RDF datasets transformed by FDFP2RDF pipeline; (2) RDF datasets developed by ETL developers. For case (1), the uniqueness of the graph name is guaranteed by Openspending system, and all other quality checking is (can be) done by appending relevant pipelines at the end; for case (2), we follow Jindrich's suggestion.
For the convenience for testing, all pipelines developed at Task 2.2 will be imported to the Linkedpipe ETL at the Fraunhofer server. @jakubklimek please update us the current status of these pipelines. @akariv I am not sure when is the uniqueness of a FDP datapackage checked, at the Step 4 of the wizard, or at the step of pushing it into database.