pelias / geonames

Import pipeline for geonames in to Pelias
https://pelias.io
MIT License
45 stars 37 forks source link

download_metadata fails to download #404

Closed louis-h-p closed 2 years ago

louis-h-p commented 2 years ago

I'm trying to install geonames. I've done this dozens of times in the past but I can't get it to download any data now. I've tried using the geonames docker image, an AWS instance and my a local vagrant image.

During the postinstall steps (npm run download_metadata) nothing is downloaded, it throws the error below immediately. Note I can download the entire AU.zip (or others) from geonames without issue.

Another question - Is it possible to use geonames to import from a local file? or do I have to rely on download_metadata etc?

vagrant@ubuntu-focal:~/geonames$ npm run download_metadata

> pelias-geonames@0.0.0-development download_metadata /home/vagrant/geonames
> mkdirp metadata && node bin/updateMetadata.js

internal/streams/legacy.js:61
      throw er; // Unhandled stream error in pipe.
      ^

CsvError: Invalid Record Length: columns length is 19, got 1 on line 1
    at Parser.__onRecord (/home/vagrant/geonames/node_modules/csv-parse/lib/index.js:792:9)
    at Parser.__parse (/home/vagrant/geonames/node_modules/csv-parse/lib/index.js:668:38)
    at Parser._transform (/home/vagrant/geonames/node_modules/csv-parse/lib/index.js:474:22)
    at Parser.Transform._read (_stream_transform.js:191:10)
    at Parser.Transform._write (_stream_transform.js:179:12)
    at doWrite (_stream_writable.js:403:12)
    at writeOrBuffer (_stream_writable.js:387:5)
    at Parser.Writable.write (_stream_writable.js:318:11)
    at Request.ondata (internal/streams/legacy.js:19:31)
    at Request.emit (events.js:314:20) {
  code: 'CSV_RECORD_DONT_MATCH_COLUMNS_LENGTH',
  bytes: 36,
  comment_lines: 0,
  empty_lines: 0,
  invalid_field_length: 0,
  lines: 1,
  records: 0,
  columns: [
    { name: 'ISO' },
    { name: 'ISO3' },
    { name: 'ISO_Numeric' },
    { name: 'fips' },
    { name: 'Country' },
    { name: 'Capital' },
    { name: 'Area' },
    { name: 'Population' },
    { name: 'Continent' },
    { name: 'tld' },
    { name: 'CurrencyCode' },
    { name: 'CurrencyName' },
    { name: 'Phone' },
    { name: 'Postal_Code_Format' },
    { name: 'Postal_Code_Regex' },
    { name: 'Languages' },
    { name: 'geonameid' },
    { name: 'neighbours' },
    { name: 'EquivalentFipsCode' }
  ],
  error: undefined,
  header: false,
  index: 1,
  column: 'ISO3',
  quoting: false,
  record: [ '# ================================' ]
}
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! pelias-geonames@0.0.0-development download_metadata: `mkdirp metadata && node bin/updateMetadata.js`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the pelias-geonames@0.0.0-development download_metadata script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/vagrant/.npm/_logs/2022-02-28T23_38_26_524Z-debug.log
missinglink commented 2 years ago

Hi @louis-h-p I'm not 100% sure what's going here but it seems to be due to a change in the file format and specifically how the '#' is being used for comments.

Using a totally unrelated CSV tool I'm able to reproduce this error, which makes me more confident the issue isn't in our codebase:

curl -s http://download.geonames.org/export/dump/countryInfo.txt | sed '/^#/d' | xsv cat rows
AD  AND 020 AN  Andorra Andorra la Vella    468 77006   EU  .ad EUR Euro    376 AD###   ^(?:AD)*(\d{3})$    ca  3041565 ES,FR
CSV error: record 1 (line: 1, byte: 110): found record with 6 fields, but the previous record has 2 fields

That said, we can hopefully work around it, I will open a PR which implements my own handling of CSV comments which seems to work fine, I'm still not completely clear on why mine works and these other ones don't 🤔

orangejulius commented 2 years ago

The Geonames servers are pretty notorious for changing file formats or hosting broken files for quite some time. Usually they change it back after a while.

But I checked and found the same thing as @missinglink. It looks like the countryInfo.txt file has a bunch of comments at the start. Pruning those out might help prevent issues like this.

missinglink commented 2 years ago

Thanks for the bug report @louis-h-p, this issue seems to be due to the geonames files changing to include a CSV comment header prefixed with # characters.

Since it's a non-standard format things broke, but we're handling it in our codebase now so please try again.

louis-h-p commented 2 years ago

Thanks @missinglink & @orangejulius . That works now.