`generate` never seems to exit

chris-aeviator commented 3 years ago

When running t_rex generate --maxzoom 20 --progress on a config that loads a .gpkg file with around 240.000 hexagons (MULTIPOLYGON, width around 10 meter) I will get a cache folder with all zoom levels. I made sure to set the maxquery value to something ridiculously high as 100 million, until I do not see more warnings.

However, I have to manually cancel the process, even after waiting 30min after the last tile has been created

Level 20: 58166 / 58166 [===============================================]

otherwise the process does not seem to finish. During cache creation I can see 12 cores beeing utilized 100% (yay - nice job), after the last tile finishes the CPU usage drops but still stays high on almost all cores. The progress is not showing anymore updates and I'm unsure if something important is still running.

It might be another issue, but I keep experiencing "holes" in the resulting tiles on certain zoom levels. The maxzoom-level always works shows the complete dataset. I can see this issue in both the included viewer, as well as in my (https://deck.gl/ based) product.

incorrect (zoom 13, zoom 12, zoom 10...)

grafik grafik

correct (zoom 14)

grafik

chris-aeviator commented 3 years ago

@pka any info why the process doesnt exit?

pka commented 3 years ago

I think, it's a logical error like https://rust-lang.github.io/wg-async-foundations/vision/status_quo/aws_engineer/solving_a_deadlock.html

Could you make the GPKG available for reproducing the problem (my email is in the Github infos)? I would like to solve that before releasing 0.14.

chris-aeviator commented 3 years ago

Happy to provide it, the smallest one to reproduce though is

Feature Count: 2452803

and roughly 800 MB

are you willing to run this?

pka commented 3 years ago

If I can reproduce the deadlock with it, sure!

chris-aeviator commented 3 years ago

Sending it to you via email with a link.

So I can see that my 786 MB file get's processed to a folder of 985 MB (zoom 12-20). After the "major work" has finished I can see a drop in the node's Network communication (I'm running --nodes 2, with both having --nodeno 0 and --nodeno1)

I say the major work seem to have finished, since I can still see unwrap errors in the t-rex log, but the cache folder won't grow anymore in size, my CPU is still utilized, but I can hear the fans go down significantly at the point the network comm stops (only around 2.5 MB/sec, can't be it)

EDIT: Maybe I have been too impatient, at least today with this dataset t-rex seems to pick up again it's work, though with around 1 core less, waiting if I can still reproduce the previous issue (that I had on a 10GB GPKG)

EDIT 2: After another 30 min CPU usage dropped another 3 cores, folder stopped at 1.7 GB, no network comms, no exit from generate, being patient this time :)

chris-aeviator commented 3 years ago

I can confirm now that with layers, that do not raise the error described inside https://github.com/t-rex-tileserver/t-rex/issues/243 do cleanly exit the generate step.

pka commented 3 years ago

Fixed in 01108a6. Connection timeouts (#243) are now properly handled.

t-rex-tileserver / t-rex

`generate` never seems to exit #240

incorrect (zoom 13, zoom 12, zoom 10...)

correct (zoom 14)