Open jakubklimek opened 5 years ago
Examples are needed - queries, invalid data etc ..
For instance, this is an example of Turtle containing invalid IRI:
<http://europeandataportal.eu/set/data/37d9420c-1aa9-43b3-8ecd-3bad592ac3a6> a <http://www.w3.org/ns/dcat#Dataset>;
<http://xmlns.com/foaf/0.1/page> <http://gis.georgsmarienhuette.de%3A80/bpl/csw/ServePdfFile.action%3Fuuid%3D37d9420c-1aa9-43b3-8ecd-3bad592ac3a6> ;
<http://purl.org/dc/terms/modified> "2011-04-26T20:15:00-23:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
<http://purl.org/dc/terms/title> "Vaarwegkenmerken in Nederland bunkerstations"@nederlands ;
<http://www.w3.org/ns/dcat#landingPage> <file://pandataIMAGISDONNEESImages.gdb>;
<http://xmlns.com/foaf/0.1/page> <http://geoportal.saarland.de/mapbender/php/mod_showMetadata.php/../wms.php%3Flayer_id%3D34720%26PHPSESSID%3D350bd85dbc227b19fffbea4100440b49%26INSPIRE%3D1%26REQUEST%3DGetCapabilities%26VERSION%3D1.1.1%26SERVICE%3DWMS> .
Apache Jena riot says:
klimek@KLIMEK-MFF-NTB:/mnt/c/Users/Kuba/Desktop$ /opt/apache-jena/bin/riot --validate test.ttl
11:00:38 WARN riot :: [line: 2, col: 36] Bad IRI: <http://gis.georgsmarienhuette.de%3A80/bpl/csw/ServePdfFile.action%3Fuuid%3D37d9420c-1aa9-43b3-8ecd-3bad592ac3a6> Code: 28/NOT_DNS_NAME in HOST: The host component did not meet the restrictions on DNS names.
11:00:38 WARN riot :: [line: 2, col: 36] Bad IRI: <http://gis.georgsmarienhuette.de%3A80/bpl/csw/ServePdfFile.action%3Fuuid%3D37d9420c-1aa9-43b3-8ecd-3bad592ac3a6> Code: 29/USE_PUNYCODE_NOT_PERCENTS in HOST: The host component used percent encoding, where punycode is preferred.
11:00:38 WARN riot :: [line: 3, col: 39] Lexical form '2011-04-26T20:15:00-23:00' not valid for datatype XSD dateTime
11:00:38 WARN riot :: [line: 4, col: 34] Language not valid: nederlands
11:00:38 WARN riot :: [line: 5, col: 43] Bad IRI: <file://pandataIMAGISDONNEESImages.gdb> Code: 57/REQUIRED_COMPONENT_MISSING in PATH: A component that is required by the scheme is missing.
11:00:38 WARN riot :: [line: 6, col: 36] Bad IRI: <http://geoportal.saarland.de/mapbender/php/wms.php%3Flayer_id%3D34720%26PHPSESSID%3D350bd85dbc227b19fffbea4100440b49%26INSPIRE%3D1%26REQUEST%3DGetCapabilities%26VERSION%3D1.1.1%26SERVICE%3DWMS> Code: 8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment /../ not at the beginning of a relative reference, or it contains a /./ These should be removed.
RDF4J says nothing, see this pipeline
There are few things to consider:
Adding validation component means, that we need to be able to read invalid data from a source, store them and load them inside a pipeline.
Especially every input component (loader) must be able to handle invalid data but not throw them away, as that is the task for the validation component. If any form of validation or auto-fix is introduced it needs to be available to all components. This can greatly increase already high duplicity in code leading to poor maintainability.
It could be something that works only with RDF dump files for a start. Usually, we are able to produce such invalid dumps.
You mean produce outside of ETL?
Not outside, as a component. Nevertheless, our particular issue was caused by a \13
character (vertical tab) in a literal from the European Data Portal, which Apache Jena Riot accepts, but OpenLink Virtuoso does not.
Many RDF dumps contain invalid triples (invalid IRIs, invalid literals, etc.) which causes problems with loading to triplestores, etc. For instance Apache Jena riot tool is able to report invalid triples. What we need is to have a component, which would omit those invalid triples, producing a valid RDF dump with invalid triples missing.