kamu-data / kamu-cli

Next-generation decentralized data lakehouse and a multi-party stream processing network
https://kamu.dev
Other
303 stars 13 forks source link

File handeling, how to read this file? #44

Closed JvD007 closed 3 years ago

JvD007 commented 3 years ago

Unclear how to fix this, data is coming from https://www.liander.nl/sites/default/files/210219%20Open%20KV-data%202021.zip

My yaml file is attached

liander.kleinverbruiksgegevens.01012021.zip

$kamu pull liander.kleinverbruiksgegevens.01012021 [4/7] Failed to update root dataset (liander.kleinverbruiksgegevens.01012021) 1 dataset(s) had errors

Maybe I'm doing something wrong

sergiimk commented 3 years ago

Hi @JvD007,

This dataset is a really tough case. It's neither a CVS of a TSV file, more like a completely invalid format.

"NETBEHEERDER   NETGEBIED   STRAATNAAM                      POSTCODE_VAN    POSTCODE_TOT    WOONPLAATS                      LANDCODE    PRODUCTSOORT    VERBRUIKSSEGMENT    AANSLUITINGEN_AANTAL    LEVERINGSRICHTING_PERC  FYSIEKE_STATUS_PERC SOORT_AANSLUITING_PERC  SOORT_AANSLUITING   SJV_GEMIDDELD   SJV_LAAG_TARIEF_PERC    SLIMME_METER_PERC"
"Liander NB ""LIANDER"" ""De Ruyterkade Steigers""      ""1011AA""      ""1011AB""      ""AMSTERDAM""                   ""NL""      ""ELK""         ""KVB""             48,00   100,00  43,75   35,42   ""3x25""            12735,00    56,25   37,50"
"Liander NB ""LIANDER"" ""De Ruyterkade""               ""1011AC""      ""1011AC""      ""AMSTERDAM""                   ""NL""      ""GAS""         ""KVB""             26,00   100,00  69,23   38,46   ""G6""              6921,00 0,00    42,31"
"Liander NB ""LIANDER"" ""De Ruyterkade""               ""1011AC""      ""1011AC""      ""AMSTERDAM""                   ""NL""      ""ELK""         ""KVB""             39,00   97,44   53,85   28,21   ""3x25""            15108,00    51,28   35,90"
"Liander NB ""LIANDER"" ""Oosterdokskade""              ""1011AD""      ""1011AE""      ""AMSTERDAM""                   ""NL""      ""GAS""         ""KVB""             11,00   100,00  9,09    81,82   ""G4""              1579,00 0,00    9,09"
"Liander NB ""LIANDER"" ""Oosterdokskade""              ""1011AD""      ""1011AD""      ""AMSTERDAM""                   ""NL""      ""ELK""         ""KVB""             19,00   100,00  0,00    57,89   ""3x25""            3919,00 47,37   0,00"

There are at least 3 problems with it:

So overall it's a sad example of how data publishers don't do the most basic quality checks on data they expose, creating a lot more work for consumers.

I was able to get data loaded using multiple preparation steps with sed command to replace the characters and "massage" the data into an actual TSV format:

apiVersion: 1
kind: DatasetSnapshot
content:
  id: liander.kleinverbruiksgegevens.01012021
  source:
    kind: root
    fetch:
      kind: url
      url: https://www.liander.nl/sites/default/files/210219%20Open%20KV-data%202021.zip
      # Use timestamp from the dataset's caching header as event_time column
      eventTimeSource:
        kind: fromMetadata
      # This dataset is non-temporal and we don't expect it to be ever updated
      cache:
        kind: forever
    prepare:
    - kind: decompress
      format: zip
      subPath: liandergegevens01012021.txt
    # Remove the quotes at the beginning and the end of each line
    - kind: pipe
      command:
      - 'sed'
      - 's/"\(.*\)"/\1/'
    # Replace the pairs of double quotes with a single double quote
    - kind: pipe
      command:
      - 'sed'
      - 's/""/"/g'
    # Remove all spaces followed by the TAB character
    - kind: pipe
      command:
      - 'sed'
      - 's/ \+\t/\t/g'
    read:
      kind: csv
      separator: "\t"
      quote: '"'
      header: true
    merge:
      kind: append

Hope this helps!