Closed cldellow closed 2 months ago
Hmmm. The Ubuntu Makefile build has failed twice in a row, on the "Generate mbtiles with Github action" workflow step. It looks like a segfault happens:
2024-09-21T03:19:08.3604627Z Reading .pbf liechtenstein.osm.pbf
2024-09-21T03:19:08.3967057Z
2024-09-21T03:19:08.3970871Z (Scanning for ways used in relations: 0%) (23 ms)
2024-09-21T03:19:08.4895426Z /home/runner/work/_temp/b7889b5c-995f-44e1-8026-c41c35ac1e82.sh: line 3: 2472 Segmentation fault (core dumped) ./tilemaker liechtenstein.osm.pbf --config=resources/config-openmaptiles.json --process=resources/process-openmaptiles.lua --output=liechtenstein.mbtiles --verbose --store /tmp/store
2024-09-21T03:19:08.4906982Z ##[error]Process completed with exit code 139.
That's pretty coincidental! I think the point of that workflow step is to verify that https://github.com/systemed/tilemaker/blob/master/action.yml is a valid GitHub action file, in the event that the PR has modified it and broken it, for example.
What's odd is that this PR doesn't change that file, and the action itself uses the docker://ghcr.io/systemed/tilemaker:master
image, which (a) isn't an artifact of this PR--it's the previously built image from the last push to master and (b) ought to be the same image on the Ubuntu Makefile and Ubuntu CMake builds, so it's odd that the Makefile build failed twice in a row, and the CMake build worked twice in a row. Maybe just random luck.
Thank you!
The CI seems to have become very flakey recently, I think largely due to dependency issues I don't fully understand. It needs sorting at some point but I don't think erratic partial failures are currently a reason not to merge stuff.
:+1: Yup, not blocking merges makes a lot of sense.
I think this specific failure--a segfault running a previously-published docker image--hints at a bug in tilemaker vs a bug in the CI scripts, unfortunately. (Separately, the CI may also be flakey, of course!)
I did some futzing locally and it looks like I'm able to repro if I run tilemaker in a loop under gdb (following the instructions at https://stackoverflow.com/a/6546674/1011680). For me, it seems to occur about 1% of the time. It occurs with or without luajit, and with or without store mode. I haven't yet seen it happen with --threads 1
, so I think we're probably not protecting a data structure appropriately against concurrent access, causing undefined behaviour. The stack traces in the faulting thread are all over the place, which I think supports this -- something read/wrote into a spot it shouldn't have, corrupted whatever happened to be nearby, and then the world ends in a random location shortly thereafter when something steps on the landmine that just got created.
The corruption must happen relatively early, which is handy. I hacked tilemaker just to do the read node phase, and then exit, so that I could get more iterations of the test. I'll keep poking around and see if I can figure out what's going on.
This fixes two issues:
Hat tip @oobayly for reporting this.
Fixes #750.