linkedpipes / etl

LinkedPipes ETL is an RDF based, lightweight ETL tool
https://etl.linkedpipes.com
Other
142 stars 30 forks source link

Add possibility (component) to fix invalid RDF dumps #660

Open jakubklimek opened 5 years ago

jakubklimek commented 5 years ago

Many RDF dumps contain invalid triples (invalid IRIs, invalid literals, etc.) which causes problems with loading to triplestores, etc. For instance Apache Jena riot tool is able to report invalid triples. What we need is to have a component, which would omit those invalid triples, producing a valid RDF dump with invalid triples missing.

skodapetr commented 5 years ago

Examples are needed - queries, invalid data etc ..

jakubklimek commented 5 years ago

For instance, this is an example of Turtle containing invalid IRI:

<http://europeandataportal.eu/set/data/37d9420c-1aa9-43b3-8ecd-3bad592ac3a6> a <http://www.w3.org/ns/dcat#Dataset>;
  <http://xmlns.com/foaf/0.1/page> <http://gis.georgsmarienhuette.de%3A80/bpl/csw/ServePdfFile.action%3Fuuid%3D37d9420c-1aa9-43b3-8ecd-3bad592ac3a6> ;
  <http://purl.org/dc/terms/modified> "2011-04-26T20:15:00-23:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
<http://purl.org/dc/terms/title> "Vaarwegkenmerken in Nederland bunkerstations"@nederlands ;
  <http://www.w3.org/ns/dcat#landingPage> <file://pandataIMAGISDONNEESImages.gdb>;
  <http://xmlns.com/foaf/0.1/page> <http://geoportal.saarland.de/mapbender/php/mod_showMetadata.php/../wms.php%3Flayer_id%3D34720%26PHPSESSID%3D350bd85dbc227b19fffbea4100440b49%26INSPIRE%3D1%26REQUEST%3DGetCapabilities%26VERSION%3D1.1.1%26SERVICE%3DWMS> .

Apache Jena riot says:

klimek@KLIMEK-MFF-NTB:/mnt/c/Users/Kuba/Desktop$ /opt/apache-jena/bin/riot --validate test.ttl
11:00:38 WARN  riot                 :: [line: 2, col: 36] Bad IRI: <http://gis.georgsmarienhuette.de%3A80/bpl/csw/ServePdfFile.action%3Fuuid%3D37d9420c-1aa9-43b3-8ecd-3bad592ac3a6> Code: 28/NOT_DNS_NAME in HOST: The host component did not meet the restrictions on DNS names.
11:00:38 WARN  riot                 :: [line: 2, col: 36] Bad IRI: <http://gis.georgsmarienhuette.de%3A80/bpl/csw/ServePdfFile.action%3Fuuid%3D37d9420c-1aa9-43b3-8ecd-3bad592ac3a6> Code: 29/USE_PUNYCODE_NOT_PERCENTS in HOST: The host component used percent encoding, where punycode is preferred.
11:00:38 WARN  riot                 :: [line: 3, col: 39] Lexical form '2011-04-26T20:15:00-23:00' not valid for datatype XSD dateTime
11:00:38 WARN  riot                 :: [line: 4, col: 34] Language not valid: nederlands
11:00:38 WARN  riot                 :: [line: 5, col: 43] Bad IRI: <file://pandataIMAGISDONNEESImages.gdb> Code: 57/REQUIRED_COMPONENT_MISSING in PATH: A component that is required by the scheme is missing.
11:00:38 WARN  riot                 :: [line: 6, col: 36] Bad IRI: <http://geoportal.saarland.de/mapbender/php/wms.php%3Flayer_id%3D34720%26PHPSESSID%3D350bd85dbc227b19fffbea4100440b49%26INSPIRE%3D1%26REQUEST%3DGetCapabilities%26VERSION%3D1.1.1%26SERVICE%3DWMS> Code: 8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment /../ not at the beginning of a relative reference, or it contains a /./ These should be removed.

RDF4J says nothing, see this pipeline

skodapetr commented 5 years ago

There are few things to consider:

Adding validation component means, that we need to be able to read invalid data from a source, store them and load them inside a pipeline.

Especially every input component (loader) must be able to handle invalid data but not throw them away, as that is the task for the validation component. If any form of validation or auto-fix is introduced it needs to be available to all components. This can greatly increase already high duplicity in code leading to poor maintainability.

jakubklimek commented 5 years ago

It could be something that works only with RDF dump files for a start. Usually, we are able to produce such invalid dumps.

skodapetr commented 5 years ago

You mean produce outside of ETL?

jakubklimek commented 5 years ago

Not outside, as a component. Nevertheless, our particular issue was caused by a \13 character (vertical tab) in a literal from the European Data Portal, which Apache Jena Riot accepts, but OpenLink Virtuoso does not.