File handeling, how to read this file?

Hi @JvD007,

This dataset is a really tough case. It's neither a CVS of a TSV file, more like a completely invalid format.

"NETBEHEERDER   NETGEBIED   STRAATNAAM                      POSTCODE_VAN    POSTCODE_TOT    WOONPLAATS                      LANDCODE    PRODUCTSOORT    VERBRUIKSSEGMENT    AANSLUITINGEN_AANTAL    LEVERINGSRICHTING_PERC  FYSIEKE_STATUS_PERC SOORT_AANSLUITING_PERC  SOORT_AANSLUITING   SJV_GEMIDDELD   SJV_LAAG_TARIEF_PERC    SLIMME_METER_PERC"
"Liander NB ""LIANDER"" ""De Ruyterkade Steigers""      ""1011AA""      ""1011AB""      ""AMSTERDAM""                   ""NL""      ""ELK""         ""KVB""             48,00   100,00  43,75   35,42   ""3x25""            12735,00    56,25   37,50"
"Liander NB ""LIANDER"" ""De Ruyterkade""               ""1011AC""      ""1011AC""      ""AMSTERDAM""                   ""NL""      ""GAS""         ""KVB""             26,00   100,00  69,23   38,46   ""G6""              6921,00 0,00    42,31"
"Liander NB ""LIANDER"" ""De Ruyterkade""               ""1011AC""      ""1011AC""      ""AMSTERDAM""                   ""NL""      ""ELK""         ""KVB""             39,00   97,44   53,85   28,21   ""3x25""            15108,00    51,28   35,90"
"Liander NB ""LIANDER"" ""Oosterdokskade""              ""1011AD""      ""1011AE""      ""AMSTERDAM""                   ""NL""      ""GAS""         ""KVB""             11,00   100,00  9,09    81,82   ""G4""              1579,00 0,00    9,09"
"Liander NB ""LIANDER"" ""Oosterdokskade""              ""1011AD""      ""1011AD""      ""AMSTERDAM""                   ""NL""      ""ELK""         ""KVB""             19,00   100,00  0,00    57,89   ""3x25""            3919,00 47,37   0,00"

There are at least 3 problems with it:

Every row is enclosed with double quotes, even the header - this makes it read as a single column instead of multi-column
The above quotes also result in doubling of the quotes inside the values
There are unnecessary whitespaces between tab separators

So overall it's a sad example of how data publishers don't do the most basic quality checks on data they expose, creating a lot more work for consumers.

I was able to get data loaded using multiple preparation steps with sed command to replace the characters and "massage" the data into an actual TSV format:

apiVersion: 1
kind: DatasetSnapshot
content:
  id: liander.kleinverbruiksgegevens.01012021
  source:
    kind: root
    fetch:
      kind: url
      url: https://www.liander.nl/sites/default/files/210219%20Open%20KV-data%202021.zip
      # Use timestamp from the dataset's caching header as event_time column
      eventTimeSource:
        kind: fromMetadata
      # This dataset is non-temporal and we don't expect it to be ever updated
      cache:
        kind: forever
    prepare:
    - kind: decompress
      format: zip
      subPath: liandergegevens01012021.txt
    # Remove the quotes at the beginning and the end of each line
    - kind: pipe
      command:
      - 'sed'
      - 's/"\(.*\)"/\1/'
    # Replace the pairs of double quotes with a single double quote
    - kind: pipe
      command:
      - 'sed'
      - 's/""/"/g'
    # Remove all spaces followed by the TAB character
    - kind: pipe
      command:
      - 'sed'
      - 's/ \+\t/\t/g'
    read:
      kind: csv
      separator: "\t"
      quote: '"'
      header: true
    merge:
      kind: append

Hope this helps!

kamu-data / kamu-cli

File handeling, how to read this file? #44