Closed JvD007 closed 3 years ago
Hi @JvD007,
This dataset is a really tough case. It's neither a CVS of a TSV file, more like a completely invalid format.
"NETBEHEERDER NETGEBIED STRAATNAAM POSTCODE_VAN POSTCODE_TOT WOONPLAATS LANDCODE PRODUCTSOORT VERBRUIKSSEGMENT AANSLUITINGEN_AANTAL LEVERINGSRICHTING_PERC FYSIEKE_STATUS_PERC SOORT_AANSLUITING_PERC SOORT_AANSLUITING SJV_GEMIDDELD SJV_LAAG_TARIEF_PERC SLIMME_METER_PERC"
"Liander NB ""LIANDER"" ""De Ruyterkade Steigers"" ""1011AA"" ""1011AB"" ""AMSTERDAM"" ""NL"" ""ELK"" ""KVB"" 48,00 100,00 43,75 35,42 ""3x25"" 12735,00 56,25 37,50"
"Liander NB ""LIANDER"" ""De Ruyterkade"" ""1011AC"" ""1011AC"" ""AMSTERDAM"" ""NL"" ""GAS"" ""KVB"" 26,00 100,00 69,23 38,46 ""G6"" 6921,00 0,00 42,31"
"Liander NB ""LIANDER"" ""De Ruyterkade"" ""1011AC"" ""1011AC"" ""AMSTERDAM"" ""NL"" ""ELK"" ""KVB"" 39,00 97,44 53,85 28,21 ""3x25"" 15108,00 51,28 35,90"
"Liander NB ""LIANDER"" ""Oosterdokskade"" ""1011AD"" ""1011AE"" ""AMSTERDAM"" ""NL"" ""GAS"" ""KVB"" 11,00 100,00 9,09 81,82 ""G4"" 1579,00 0,00 9,09"
"Liander NB ""LIANDER"" ""Oosterdokskade"" ""1011AD"" ""1011AD"" ""AMSTERDAM"" ""NL"" ""ELK"" ""KVB"" 19,00 100,00 0,00 57,89 ""3x25"" 3919,00 47,37 0,00"
There are at least 3 problems with it:
So overall it's a sad example of how data publishers don't do the most basic quality checks on data they expose, creating a lot more work for consumers.
I was able to get data loaded using multiple preparation steps with sed
command to replace the characters and "massage" the data into an actual TSV format:
apiVersion: 1
kind: DatasetSnapshot
content:
id: liander.kleinverbruiksgegevens.01012021
source:
kind: root
fetch:
kind: url
url: https://www.liander.nl/sites/default/files/210219%20Open%20KV-data%202021.zip
# Use timestamp from the dataset's caching header as event_time column
eventTimeSource:
kind: fromMetadata
# This dataset is non-temporal and we don't expect it to be ever updated
cache:
kind: forever
prepare:
- kind: decompress
format: zip
subPath: liandergegevens01012021.txt
# Remove the quotes at the beginning and the end of each line
- kind: pipe
command:
- 'sed'
- 's/"\(.*\)"/\1/'
# Replace the pairs of double quotes with a single double quote
- kind: pipe
command:
- 'sed'
- 's/""/"/g'
# Remove all spaces followed by the TAB character
- kind: pipe
command:
- 'sed'
- 's/ \+\t/\t/g'
read:
kind: csv
separator: "\t"
quote: '"'
header: true
merge:
kind: append
Hope this helps!
Unclear how to fix this, data is coming from https://www.liander.nl/sites/default/files/210219%20Open%20KV-data%202021.zip
My yaml file is attached
liander.kleinverbruiksgegevens.01012021.zip
$kamu pull liander.kleinverbruiksgegevens.01012021 [4/7] Failed to update root dataset (liander.kleinverbruiksgegevens.01012021) 1 dataset(s) had errors
Maybe I'm doing something wrong