pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
26 stars 42 forks source link

Corrupt WOF records are not skipped during import #405

Open nilsnolde opened 5 years ago

nilsnolde commented 5 years ago

Hi, there were quite a few corrupt WOF postal code records during download, e.g.

error downloading whosonfirst-data-postalcode-ca-latest.tar.bz2 bundle: Error: Command failed: curl https://dist.whosonfirst.org/bundles/whosonfirst-data-postalcode-ca-latest.tar.bz2 | tar -xj --strip-components=1 --exclude=README.txt -C /srv/pelias_importer_ext/data/wof && mv /srv/pelias_importer_ext/data/wof/whosonfirst-data-postalcode-ca-latest.csv /srv/pelias_importer_ext/data/wof/meta

That was a few days ago. Weirdly, curl https://dist.whosonfirst.org/bundles/whosonfirst-data-postalcode-ca-latest.tar.bz2 | tar -xj --strip-components=1 --exclude=README.txt works now, if done manually. Even though the timestamp of that record didn't change since July..

There's a bunch of bzip2 and tar errors in the logs for the above command. Same for Japan, Portugal and GB postal codes. When downloading is finished and he's trying to inject them to ES, the following fatal error occurs:

2018-09-27T19:29:31.322Z - ESC[32minfoESC[39m: [whosonfirst] Loading whosonfirst-data-postalcode-ca-latest.csv records from /srv/pelias_importer_ext/data/wof/meta
events.js:183
      throw er; // Unhandled 'error' event
      ^

Error: ENOENT: no such file or directory, open '/srv/pelias_importer_ext/data/wof/meta/whosonfirst-data-postalcode-ca-latest.csv'
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! pelias-whosonfirst@0.0.0-development start: `node import.js`
npm ERR! Exit status 1
npm ERR! 
npm ERR! Failed at the pelias-whosonfirst@0.0.0-development start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/ubuntu/.npm/_logs/2018-09-27T19_29_31_479Z-debug.log

Then the WOF importer ends on error instead of skipping the Canada postal codes. Any chance of improving that behavior?

nilsnolde commented 5 years ago

And if this would be a good first issue, let me know. I'm serious about contributing. At some point I wanna have another Pelias importer for custom user data anyways. Getting to know the stack with smaller issues would be helpful. But then a quick pointer in the right direction would be appreciated:)

missinglink commented 5 years ago

Hi @nilsnolde. What do you think the issue is here?

Do you think maybe it was just that the server was having a bad day and now it works, or is there a difference between executing the commands manually on the CLI vs. using the Pelias scripts?

nilsnolde commented 5 years ago

Well, it has to be a network issue. I'm not saying there's a flaw in the code, obviously I executed the exact same command manually without problems. I does happen consistently though: canada won't get processed when using the importer, it's always the above error over the last 2-3 weeks. The real problem is likely curl and its GnuTLS. So, yes, I have to dig in there.

My question here is: the importer could also skip files, in case it doesn't find them right. Instead of failing entirely.

orangejulius commented 5 years ago

We had a bug a while back where one layer's data was being downloaded twice, and the two downloads interacted badly together, so something like that could happen still.

However, in general I think all our importers need to support a missingFilesAreFatal flag everywhere, like this importer already does in some places.

Sometimes, you want a build to fail if anything at all goes wrong. Sometimes you want to ignore failures to download/parse data. Most Pelias code was written to fail if anything goes wrong, since that's what we wanted for Mapzen Search, but I think we need to support both modes everwhere.

nilsnolde commented 5 years ago

Ah, damn it, there's the switch right there!! And of course it's set true in our pelias.json..

Thanks @orangejulius again!

And for the record: I agree, having a switch is really good. At least for importers importing heaps of file like wof and oa.

orangejulius commented 5 years ago

Oh, I didn't know it would actually work :) There's a lot of ways to download files in this repo, and I bet in at least some of the code paths, failures are still not ignored. So if you do see that, let us know.

nilsnolde commented 5 years ago

Hm ok, sorry, apparently setting missingFilesAreFatal=false didn't fix the issue:

0 info it worked if it ends with ok
1 verbose cli [ '/usr/bin/node', '/usr/bin/npm', 'start' ]
2 info using npm@5.6.0
3 info using node@v8.11.3
4 verbose run-script [ 'prestart', 'start', 'poststart' ]
5 info lifecycle pelias-whosonfirst@0.0.0-development~prestart: pelias-whosonfirst@0.0.0-development
6 info lifecycle pelias-whosonfirst@0.0.0-development~start: pelias-whosonfirst@0.0.0-development
7 verbose lifecycle pelias-whosonfirst@0.0.0-development~start: unsafe-perm in lifecycle true
8 verbose lifecycle pelias-whosonfirst@0.0.0-development~start: PATH: /usr/lib/node_modules/npm/node_modules/npm-lifecycle/node-gyp-bin:/home/ubuntu/whosonfirst/node_modules/.bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
9 verbose lifecycle pelias-whosonfirst@0.0.0-development~start: CWD: /home/ubuntu/whosonfirst
10 silly lifecycle pelias-whosonfirst@0.0.0-development~start: Args: [ '-c', './bin/start' ]
11 silly lifecycle pelias-whosonfirst@0.0.0-development~start: Returned: code: 1  signal: null
12 info lifecycle pelias-whosonfirst@0.0.0-development~start: Failed to exec start script
13 verbose stack Error: pelias-whosonfirst@0.0.0-development start: `./bin/start`
13 verbose stack Exit status 1
13 verbose stack     at EventEmitter.<anonymous> (/usr/lib/node_modules/npm/node_modules/npm-lifecycle/index.js:285:16)
13 verbose stack     at emitTwo (events.js:126:13)
13 verbose stack     at EventEmitter.emit (events.js:214:7)
13 verbose stack     at ChildProcess.<anonymous> (/usr/lib/node_modules/npm/node_modules/npm-lifecycle/lib/spawn.js:55:14)
13 verbose stack     at emitTwo (events.js:126:13)
13 verbose stack     at ChildProcess.emit (events.js:214:7)
13 verbose stack     at maybeClose (internal/child_process.js:925:16)
13 verbose stack     at Process.ChildProcess._handle.onexit (internal/child_process.js:209:5)
14 verbose pkgid pelias-whosonfirst@0.0.0-development
15 verbose cwd /home/ubuntu/whosonfirst
16 verbose Linux 4.4.0-28-generic
17 verbose argv "/usr/bin/node" "/usr/bin/npm" "start"
18 verbose node v8.11.3
19 verbose npm  v5.6.0
20 error code ELIFECYCLE
21 error errno 1
22 error pelias-whosonfirst@0.0.0-development start: `./bin/start`
22 error Exit status 1
23 error Failed at the pelias-whosonfirst@0.0.0-development start script.
23 error This is probably not a problem with npm. There is likely additional logging output above.
24 verbose exit [ 1, true ]

Any idea?

jeremy-rutman commented 5 years ago

I also hit a download error (in a file that downloads with wget in a few seconds) and so set missingFileAreFatal to false, after which the downloads seem to be fine - either due to the switch or a serendipitously faster connection but this didnt help matters . I did see that the tar took a long time (more than 5 min) once I had the file with wget . SO possibly the line in download_data_all.js

`curl -s ${wofDataHost}/bundles/${bundle} | tar -xj --strip-components=1 --exclude=README.txt -C ` +
`${directory} && mv ${path.join(directory, csvFilename)} ${path.join(directory, 'meta')}`;

should be broken into several lines. I don;t know if wget is more reliable than curl

jeremy-rutman commented 5 years ago

ok by changing to wget and local files this runs to completion , I will do a pr

  function generateCommand(bundle, directory) {
    const csvFilename = bundle.replace(/-\d{8}T\d{6}-/, '-latest-') // support timestamped downloads
                              .replace('.tar.bz2', '.csv');
//    return `curl -s ${wofDataHost}/bundles/${bundle} | tar -xj --strip-components=1 --exclude=README.txt -C ` +
//      `${directory} && mv ${path.join(directory, csvFilename)} ${path.join(directory, 'meta')}`;
     var command=`wget ${wofDataHost}/bundles/${bundle} && tar -xvj --strip-components=1 --exclude=README.txt -C ${directory} -$
     console.log(command);
     return command;
  }