pelias / geonames

Import pipeline for geonames in to Pelias
https://pelias.io
MIT License
43 stars 37 forks source link

`Error: invalid signature` when unzipping Geonames data file #26

Open heffergm opened 8 years ago

heffergm commented 8 years ago

This occurs sporadically.

root@worker1:/mnt/pelias/logs# cat /mnt/pelias/logs/geonames_all.err
[Error: invalid signature: 0xc7cf971f]
_writableState.buffer is deprecated. Use _writableState.getBuffer() instead.
events.js:85
      throw er; // Unhandled 'error' event
            ^
Error: invalid signature: 0xc7cf971f
    at /mnt/pelias/pelias-geonames/releases/20160108001248/node_modules/geonames-stream/node_modules/unzip/lib/parse.js:63:13
    at processImmediate [as _immediateCallback] (timers.js:358:17)
orangejulius commented 8 years ago

Since this ticket we've completely revamped the Geonames importer. This looks like a transient issue related to an invalid download file. If it happens again I will take another look.

orangejulius commented 7 years ago

Hello two years later, this issue has happened again and is an issue in the unzip utility we use. Fortunately there is a replacement

orangejulius commented 7 years ago

This appears to not necessarily be an issue with the unzip NPM package, but something about the files that are downloaded to disk that causes them to be slightly corrupt in a way the unzip command line program handles fine, but the unzip NPM package does not. It's unclear if it's our downloader causing the corruption or if the Geonames server distributes the files in this corrupted state.

orangejulius commented 7 years ago

This issue is still occuring in our builds, despite the attempts in #154 and https://github.com/pelias/geonames/pull/171 to solve or work around it. This has been happening periodically since the very creation of this repository (it is in fact a dupe of the very first issue in the repo).

We need to consider using an alternate unzip method, such as a commandline unzip that is more robust, or some sort of other solution.

asdfasdafas commented 6 years ago

Does anyone know if there a work-around for this? I'm encountering this issue currently, and I'm unable to complete the import.

orangejulius commented 6 years ago

Hey @asdfasdafas, We have somewhat of a workaround, but its not great. Since the problem is (we think) inherent in the zipfile as published by Geonames, we get around it for Mapzen Search by caching old, valid zipfiles.

One possible alternative workaround would be to change our code to avoid using the node.js zip library, and use a standard commandline unzip. This would require some reorganizing of the code in this importer, but if you were interested in taking a look at it I'd be happy to help point you in the right direction. We would gladly accept a PR that does that :)

asdfasdafas commented 6 years ago

Ah I probably wouldn't be much help on the node.js code, but would you happen to know where I could download copies of the known-good geonames files?

orangejulius commented 6 years ago

No worries. This is the one we have cached for Mapzen Search: https://s3.amazonaws.com/pelias-data/geonames/allCountries.zip

Its modification time is Nov 18, 2017 7:05:50 PM GMT-0500, so its not TOO old.

orangejulius commented 6 years ago

An update here: as it turns out, there is no correct way to stream a zip file without loading it into memory. This makes sense, as you can't pipe to or from unzip on the command line.

We have two options, switch to using a library like yazul which implements a non-streaming API for reading zip files, or extract zip files after download to expose the underlying text file, which IS stream-able.

My vote is for the second approach, since it would have the added benefit of removing code, whereas adjusting our existing code to use yazul may be a bit of tedious work.

In either case, https://github.com/pelias/geonames/issues/297 is effectively a prerequisite.

orangejulius commented 6 years ago

Update: a possible workaround here is to download the broken Geonames zip file, extract the data with unzip, and then re-compress it with zip. This seems to create archives that the importer can successfully read.