linkedpipes / etl

LinkedPipes ETL is an RDF based, lightweight ETL tool
https://etl.linkedpipes.com
Other
143 stars 30 forks source link

t-tabular: Property IRIs escaped inconsistently #232

Open jakubklimek opened 7 years ago

jakubklimek commented 7 years ago

Property IRIs are escaped inconsistently. In the filename part, they are not escaped at all, resulting in invalid IRIs (containing spaces) and in the path part, they are escaped as URIs (including UTF-8 characters) which is unnecessary with IRIs - Czech characters do not need to be escaped anymore.

_:node1au259a7px100 <file:///Bankovní spojení-input.xlsx.csv#%C4%8C%C3%ADslo+%C3%BA%C4%8Dtu> "7921371" ;
    <file:///Bankovní spojení-input.xlsx.csv#Bankovn%C3%AD+spojen%C3%AD> "Česká národní banka" ;
    <file:///Bankovní spojení-input.xlsx.csv#IBAN> "CZ8907100010110007921371" ;
    <file:///Bankovní spojení-input.xlsx.csv#K%C3%B3d+banky> "0710" ;
    <file:///Bankovní spojení-input.xlsx.csv#Oblast> "Důchodové pojištění OSVČ" ;
    <file:///Bankovní spojení-input.xlsx.csv#P%C5%99ed%C4%8D%C3%ADsl%C3%AD> "1011" ;
    <file:///Bankovní spojení-input.xlsx.csv#Pracovi%C5%A1t%C4%9B> "OSSZ Plzeň-sever" ;
    <file:///Bankovní spojení-input.xlsx.csv#SWIFT> "CNBACZPP" .

It would be best if 1) Escaping would be consistent 2) UTF-8 characters were not escaped e.g.

_:node1au259a7px100 <file:///Bankovní+spojení-input.xlsx.csv#Číslo+Účtu> "7921371" ;
    <file:///Bankovní+spojení-input.xlsx.csv#Bankovní+spojení> "Česká národní banka" ;
    <file:///Bankovní+spojení-input.xlsx.csv#IBAN> "CZ8907100010110007921371" ;
    <file:///Bankovní+spojení-input.xlsx.csv#Kód+banky> "0710" ;
    <file:///Bankovní+spojení-input.xlsx.csv#Oblast> "Důchodové pojištění OSVČ" ;
    <file:///Bankovní+spojení-input.xlsx.csv#Předčíslí> "1011" ;
    <file:///Bankovní+spojení-input.xlsx.csv#Pracoviště> "OSSZ Plzeň-sever" ;
    <file:///Bankovní+spojení-input.xlsx.csv#SWIFT> "CNBACZPP" .
skodapetr commented 5 years ago

It seems to be consistent now:

_:node1df0cpihix2 <file://Bankovní\u0020spojení-input.xlsx.csv#Bankovn%C3%AD+spojen%C3%AD>
    "Česká národní banka";
  <file://Bankovní\u0020spojení-input.xlsx.csv#K%C3%B3dbanky> "0710" .`

We are using URLEncoder.encode(part, "UTF-8") for encoding, so we would need to find/create another function that would keep UTF-8.

jakubklimek commented 5 years ago

That is still not consistent - why is the space in the filename part encoded like \u0020, which is the unicode encoding used in Turtle, and not percent endocing %20 as it would be in the fragment part?