Open marklit opened 1 year ago
I don't immediately know what is going on here, but I have been making other improvements to Tippecanoe in https://github.com/felt/tippecanoe, including some that are meant to reduce memory consumption, so I would suggest trying with that version.
If you can share your data file, I can try to reproduce the problem myself.
I'm going to run that fork you mentioned and I'll report back with my results.
The dataset itself is the last 14 releases of the FCC's 'without satellite' 477 data.
https://www.fcc.gov/general/broadband-deployment-data-fcc-form-477
The 14 CSV files were converted into JSONL. At one point this data lived in my client's BigQuery instance and ST_GEOGFROMWKB
was used to convert the WKB to a text-based version. I had to run that column through shapely
and geojson
before it was ready for tippecanoe.
I don't have a good way to share the ~20 GB compressed version of this dataset. It might be quicker to download the latest release, convert it to JSONL and then duplicate it 14 times.
Just to report back, I tried that fork and the job was killed after some time. The VM I ran it on has 64 GB.
I partitioned the 60M records on their H3 zoom level 1 values, this broke the records up into 44 files. They weren't even in size but it was the quickest way I could think of to break up the dataset.
GeoJSON Size | Filename |
---|---|
16G | 812abffffffffff |
14G | 81267ffffffffff |
11G | 8126fffffffffff |
7.8G | 8144fffffffffff |
7.0G | 81263ffffffffff |
6.2G | 81277ffffffffff |
5.1G | 812a3ffffffffff |
4.7G | 81447ffffffffff |
4.2G | 8129bffffffffff |
3.6G | 8126bffffffffff |
3.6G | 8148bffffffffff |
3.2G | 81283ffffffffff |
2.8G | 8128bffffffffff |
2.8G | 8148fffffffffff |
2.2G | 812bbffffffffff |
1.9G | 8128fffffffffff |
1.5G | 8127bffffffffff |
1.4G | 812afffffffffff |
1.2G | 8144bffffffffff |
1.1G | 81443ffffffffff |
1.1G | 814cfffffffffff |
521M | 812b3ffffffffff |
409M | 81273ffffffffff |
392M | 8112fffffffffff |
90M | 81293ffffffffff |
85M | 81467ffffffffff |
81M | 810c7ffffffffff |
61M | 815d3ffffffffff |
41M | 810d7ffffffffff |
34M | 81487ffffffffff |
18M | 814f7ffffffffff |
14M | 810c3ffffffffff |
13M | 8112bffffffffff |
12M | 8113bffffffffff |
11M | 810d3ffffffffff |
11M | 810cfffffffffff |
4.3M | 819a3ffffffffff |
1.9M | 810cbffffffffff |
1.2M | 8122fffffffffff |
1.1M | 811d3ffffffffff |
1.1M | 810dbffffffffff |
448K | 819bbffffffffff |
286K | 814e7ffffffffff |
71K | 81227ffffffffff |
I then ran tippecanoe on them, one file at a time as there were RAM usage spikes and I didn't want to suffer any OOM issues. Usually, ~13 GB of RAM was being used on my 64 GB system though this would spike at odd times. The process took 3 weeks to complete.
$ ls 8*.geojson \
| xargs -P1 \
-n1 \
-I% \
bash -c 'HEXVAL=`echo % | sed "s/.geojson//g"`; tippecanoe --coalesce-densest-as-needed -zg --extend-zooms-if-still-dropping -e fcc_477_$HEXVAL $HEXVAL.geojson'
The process produced 4.6 GB of PBF data across 168,754 files.
I took a 100K record sample (~179 MB in GeoJSON) and ran it through strace and produced a FlameGraph. On an e2-highmem-4 with 4 vCPUs and 32 GB of RAM in GCP's LA zone the following runs in 115 seconds and produces 31.7K PBFs totalling 147 MB in size. This is one PBF for roughly every 3 records.
$ tippecanoe \
--coalesce-densest-as-needed \
-zg \
--extend-zooms-if-still-dropping \
-e \
fcc_477 \
out_100k.geojson
There are 30K write
calls, 37K read
calls and ~2K openat
calls. Around ~96% of the time is waiting on futex
.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
96.61 34.837147 757329 46 3 futex
1.66 0.600337 20 29388 write
1.16 0.417305 11 37360 read
0.32 0.116778 59 1963 close
0.10 0.035662 18 1965 1 openat
0.03 0.011153 12 927 fcntl
0.02 0.008630 18 467 unlink
0.02 0.006390 114 56 munmap
0.02 0.006108 11 517 fstat
0.02 0.005534 11 467 getpid
0.02 0.005515 50 110 madvise
0.01 0.004425 41 107 clone
0.00 0.001751 17 98 mmap
0.00 0.001383 17 78 brk
0.00 0.000258 15 17 mprotect
0.00 0.000223 223 1 execve
0.00 0.000143 17 8 ftruncate
0.00 0.000086 86 1 mkdir
0.00 0.000069 8 8 pread64
0.00 0.000035 17 2 2 stat
0.00 0.000035 17 2 getdents64
0.00 0.000032 10 3 rt_sigaction
0.00 0.000023 11 2 1 arch_prctl
0.00 0.000018 8 2 prlimit64
0.00 0.000017 17 1 sysinfo
0.00 0.000015 15 1 fstatfs
0.00 0.000015 14 1 1 access
0.00 0.000014 14 1 lseek
0.00 0.000010 9 1 set_tid_address
0.00 0.000009 9 1 set_robust_list
0.00 0.000007 7 1 rt_sigprocmask
------ ----------- ----------- --------- --------- ----------------
100.00 36.059126 73602 8 total
This operation also has around 9.6K context switches and 43K page faults.
9,664 context-switches # 81.434 /sec
105 cpu-migrations # 0.885 /sec
43,284 page-faults # 364.735 /sec
Below is a FlameGraph:
The main overhead appears to be all the files that need to be written out. If there were fewer PBFs this process should run a lot quicker. It also leads to the small file problem where you end up with a lot of file system overhead from simply having too many files. Is there a way to cut down the number of PBFs being produced?
If I output to a single .mbtiles file it takes substantially longer so I'm not sure if that alone would be an answer for a 60M-record dataset that takes 3 weeks to convert to PBFs.
I don't have much more to report in terms of RAM consumption but if that can be kept down I should be able to run more tippecanoe commands in parallel with one another. The RAM ceiling-to-process ratio is very high at the moment and RAM is the most expensive $/GB piece of hardware on GCP.
I'm running the following on a system with 64 cores and 64 GB of RAM. After a few hours of running the application appears to exhaust all available memory and is terminated by the Kernel.
Is there any workaround for this?
Here is an example record from the 102 GB JSONL file: