protomaps / go-pmtiles

Single-file executable tool for working with PMTiles archives
BSD 3-Clause "New" or "Revised" License
350 stars 48 forks source link

Slow query for totalTiles in MBTiles -> PMTiles conversion #127

Closed lseelenbinder closed 6 months ago

lseelenbinder commented 7 months ago

We're using pmtiles convert to do some large conversion jobs (including global data). Along the way, I realized that there's a select count(*) from tiles that only exists to produce accurate progress bars.

What would your preferred approach to eliminating this query (it can take 10s of minutes on large enough archives for essentially no real result)? I could see coordinating this with #117 or adding a flag something along the lines of --impreciseProgress. I'm happy to do either, but would like to know what you'd prefer.

bdon commented 7 months ago

Is running in quiet mode with no logged output acceptable for your use case?

lseelenbinder commented 7 months ago

As long as the side-effect means the query doesn't happen, it's definitely an improvement over the status quo.

Ideally we'd also have a solution that allows for some progress reporting in CI—but of course we'd face similar problems mentioned in #117 related to thousands of log lines. Ideally, I'd advocate for a more CI-friendly style of progress reporting behind a flag as a good final state.

mem48 commented 6 months ago

I'm not sure if this is related but I'm experiencing really slow performance on converting mbtiles to pmtiles.

My dataset is a 158 GB .mbtiles containing 2m resolution raster data saved as webp images produced from rio rgbify.

I'm currently on Pass 2: writing tiles with the progress bar saying:

11% | (405094/3638545, 54 it/min) [57h56m55s:989h46m23s]

I know it is a large dataset, but surely it shouldn't take 40 days on a high-spec desktop. The PC doesn't appear to be working hard (10% CPU, 8% Memory, 1% Disk)

bdon commented 6 months ago

That sounds like a separate issue, can you upload your 158GB .mbtiles somewhere so others can reproduce?

mem48 commented 6 months ago

@bdon A share link to my mbtiles file https://leeds365-my.sharepoint.com/:f:/g/personal/earmmor_leeds_ac_uk/EnCVP-D0VXtAvHdz14qlI7MBSeX-DSwe_BE3yagjt9yWJg?e=6E2wzK it is still uploading so you will have to wait a while. Currently uploading at 9 MB/s so will take about 5 hours to upload. Edit: Upload complete and will be available for 30 days

bdon commented 6 months ago

@mem48 did you try running with convert --no-deduplication?

bdon commented 6 months ago

@mem48 your mbtiles is missing an index:

CREATE UNIQUE INDEX tile_index on tiles (zoom_level, tile_column, tile_row);
bdon commented 6 months ago

@lseelenbinder are your mbtiles missing the index too? it's not a MUST via the spec but the convert as a whole will be slow without it

bdon commented 6 months ago

@mem48 on my laptop converting your file with the index added and --no-deduplication took 27 minutes

lseelenbinder commented 6 months ago

@lseelenbinder are your mbtiles missing the index too? it's not a MUST via the spec but the convert as a whole will be slow without it

Yes, there's an index, though the mbtiles uses the tiles view with a mapping + data table to deduplicate, so it's a bit slower than a simple setup. It looks like #142 fully fixes this for our purposes though.

mem48 commented 6 months ago

Thanks @bdon Adding the index fixed my problem and only took 45 minutes even with deduplication. Is it worth a check and warning in go-pmtiles given the enormous effect on performance?

In case anybody else wants to reproduce my fix on Ubunut 22 was:

sudo apt-get install sqlite3 libsqlite3-dev
sqlite3 DSM_England_2m.mbtiles
sqlite> CREATE UNIQUE INDEX tile_index on tiles (zoom_level, tile_column, tile_row);
sqlite> .exit
./pmtiles convert DSM_England_2m.mbtiles DSM_England_2m.pmtiles

Adding an index with rio rgbify is possible but requires using the --co NAME=VALUE argument according to ChatGPT --co SPATIAL_INDEX=YES or --co INIT_WITH_SQL="CREATE UNIQUE INDEX tile_index on tiles (zoom_level, tile_column, tile_row);" should work but I have not tested this yet.

bdon commented 6 months ago

can you find a way to detect the missing index in SQL?

bdon commented 6 months ago

1.18 removes the query for count because there's not much point in executing the same thing twice (table scan) for two passes. Looks like this is improved if there's a unique index on tiles per the MBTiles spec.

Related to the missing index, we can't make any assumptions about whether the tiles table is a view or a table, either we can find some portable internal SQLite heuristic, or we point out in the docs that your tiles table should be indexed.

voncannon commented 5 months ago

Appreciate these changes... I don't have timings from before, but it is way way faster working with my last conversion of 300 GB.

Only took 1 hour (macOS M2 Max) writing on an NVMe

(base) bvc@bvc gstore % ls -lh usgstopo.mbtiles
-rw-r--r--@ 1 bvc  staff   293G Feb 29 22:32 usgstopo.mbtiles

(base) bvc@bvc gstore % time ./pmtiles convert usgstopo.mbtiles usgstopo.pmtiles 
2024/02/29 22:41:06 convert.go:260: Pass 1: Assembling TileID set
2024/02/29 22:41:10 convert.go:291: Pass 2: writing tiles
100% |████████████████████████████████████████████████████████████████████████████████████████████████████████████████| (20332169/20332169, 5866 it/s)            
2024/02/29 23:38:56 convert.go:345: # of addressed tiles:  20332169
2024/02/29 23:38:56 convert.go:346: # of tile entries (after RLE):  19768407
2024/02/29 23:38:56 convert.go:347: # of tile contents:  19594193
2024/02/29 23:38:59 convert.go:362: Root dir bytes:  9385
2024/02/29 23:38:59 convert.go:363: Leaves dir bytes:  42486391
2024/02/29 23:38:59 convert.go:364: Num leaf dirs:  3501
2024/02/29 23:38:59 convert.go:365: Total dir bytes:  42495776
2024/02/29 23:38:59 convert.go:366: Average leaf dir bytes:  12135
2024/02/29 23:38:59 convert.go:367: Average bytes per addressed tile: 2.09
2024/02/29 23:41:09 convert.go:340: Finished in  1h0m3.023499209s
./pmtiles convert usgstopo.mbtiles usgstopo.pmtiles  572.24s user 611.80s system 32% cpu 1:00:03.89 total

(base) bvc@bvc gstore % ls -lh usgstopo.pmtiles
-rw-r--r--  1 bvc  staff   278G Feb 29 23:41 usgstopo.pmtiles