mapbox / rio-mbtiles

A plugin command for the Rasterio CLI that exports a raster dataset to an MBTiles 1.1 SQLite file
MIT License
125 stars 36 forks source link

skip processing empty tiles #34

Closed jdmcbr closed 6 years ago

jdmcbr commented 7 years ago

This is a work in progress, but I've reached a point where I think feedback would be very useful.

I regularly am converting rasters that have significant areas of nodata into mbtiles, which motivates my desire to reduce time processing areas of nodata. The change submitted here is working well enough in terms of producing the right end results, at least on all cases I've tested upon. It successfully skips over regions with nodata, while producing an mbtiles file that looks the same. For tifs without any area of nodata, this appears to be 10-20% slower, with the real time difference somewhat smaller than the user time difference. As areas of nodata become more significant, this PR starts to become faster, with a turnover point around 30-40% nodata, unless the --image-dump option is set, in which case the turnover is earlier. It does depend somewhat on the --zoom-level range too.

The check for nodata in a tile currently happens outside the process_tile call. This leads to some duplicated logic in getting tile bounds. I wasn't sure of a clean way though to skip processing a tile after entering process_tile though, whereas putting it outside the call made it easy to skip over, but maybe I'm missing something obvious? I also wondered whether there might be a faster method of checking whether there is data within specified bounds of a raster.

I expect any implementation of this is probably going to be a little slower on tifs without nodata anywhere. Would an extra argument to enable this logic be reasonable, such that the user can decide whether to skip processing nodata regions? On the other hand, it still seems possible that implementing this differently would make the speed difference negligible on a raster that doesn't have any region of nodata, and it is obviously desirable to avoid adding an argument whose purpose would be obscure to most users, and for which the fastest setting would probably not be obvious to anyone.

sgillies commented 7 years ago

@jdmcbr at first read, this seems to me to do too much work per tile. I could be wrong out that. You do have any benchmarks?

Would you be willing to consider either checking the image bytes at https://github.com/mapbox/rio-mbtiles/blob/master/mbtiles/__init__.py#L59 to see if the tiles are empty. It seems like all-0 JPEGs or PNGs should have a signature we could use.

It would also be possible to read the mask of the vsimem temp file and see if that has any valid pixels.

jdmcbr commented 7 years ago

@sgillies My intuition, possibly off base, was that any check that waited until after the reproject call would likely have minimal value in decreasing run time. I didn't test that though, and certainly could try. I agree though that the work per tile is too high as written now, which was the main motivation to check in to see if you had ideas where this check might better go.

All my benchmarking experience is basically using ipython tools, so for the rough results I reported above, I was just using the unix time utility.

jdmcbr commented 7 years ago

Evidently I forgot to run the test suite after making this change. I'm not seeing at the moment what the issue is; running rio mbtiles on the image in the test data folder produces fine looking results, including a tile image at the location of the tile passed in for the test. And all the tests files I ran this on were produced appropriately. I'll look again at this tomorrow.

Anyway, this is a fairly small change from my original version, compared to the others you suggested. I couldn't see how to read the vsimem temp file to get the answer without going through similar steps as in the approach I started with. This small change was enough to get it do pretty well over the tests I ran. Though I'm sure there are test datasets out there for which the version in this branch would do less well, as compared to master. For my test images though (the one labeled 80% nodata raster), this branch did at worst just about as well as master, and at best (for one of the files that really motivated me to look at this), it shaved ~30% off run time. For that raster, the lower zoom levels didn't actually include a ton of tiles with nodata, so the speeds were fairly comparable, but once I got to zoom 17 and especially 18, many of those tiles were nodata, and started to produce noticeable gains. Consistent with that, another raster had ~50% nodata, and I actually got a ~40% time shaving because I went over a wider range of zoom levels (down to 19).

master, 0% nodata raster, zoom 10..20
real     0m24.095s
user    0m28.818s
sys     0m1.004s

real    0m24.337s
user    0m28.973s
sys     0m1.088s

skip-v2, 0% nodata raster, zoom 10..20
real    0m23.988s
user    0m29.300s
sys     0m1.008s

real    0m24.026s
user    0m29.923s
sys     0m1.125s

-------------------------------------------------

master, 80% nodata raster, zoom 12..14
real    0m35.410s
user    1m39.214s
sys     0m2.562s

skip-v2, 80% nodata raster, zoom 12..14
real    0m28.367s
user    1m21.301s
sys     0m1.813s

-------------------------------------------------

master, 80% nodata raster, zoom 12..15
real    1m7.456s
user    3m15.152s
sys     0m4.691s

skip-v2, 80% nodata raster, zoom 12..15
real    0m59.370s
user    2m51.536s
sys     0m3.985s

-------------------------------------------------

master, 80% nodata raster, zoom 12..16
real    2m18.094s
user    6m37.580s
sys     0m12.858s

skip-v2, 80% nodata raster, zoom 12..16
real    2m12.722s
user    6m21.153s
sys     0m11.699s

-------------------------------------------------

master, 80% nodata raster, zoom 12..17
real    5m36.701s
user    14m46.881s
sys     0m45.438s

skip-v2, 80% nodata raster, zoom 12..17
real    4m45.447s
user    13m26.167s
sys     0m37.342s

-------------------------------------------------

master, 80% nodata raster, zoom 12..18
real    17m55.388s
user    35m10.248s
sys     2m50.693s

skip-v2, 80% nodata raster, zoom 12..18
real    11m44.133s
user    32m10.696s
sys     2m22.486s

-------------------------------------------------

master, 50% nodata raster, zoom 10..19
real    0m22.587s
user    0m22.241s
sys     0m1.013s

real    0m22.699s
user    0m22.238s
sys     0m1.060s

skip-v2, 50% nodata raster, zoom 10..19
real    0m13.593s
user    0m21.957s
sys     0m0.913s

real    0m13.679s
user    0m21.750s
sys     0m0.828s
jdmcbr commented 6 years ago

@sgillies Apart from the tile size discussion, I'd appreciate a sanity check on one thing before merging. For the test_process_tile, you made a change in #36 to the number of expected output tiles, from 6 to 5, based on the nodata check. I did not make such a change here, because this approach is still generating 6 tiles in that test. It seemed like it was actually correct that there should be windows at the edges that have data, but would be all nodata in the tile generated from the reproject call.

Once I was convinced that I wasn't missing anything, I also wanted to add a second test like test_process_tile with an additional zoom level for the RGB data set, in which some of the tiles are skipped.

jdmcbr commented 6 years ago

@sgillies As it stands, this change requires rasterio>=1.0a10, due to the change in the boundless handling in windows.from_bounds. Should I update the requirements file? Or should this be modified to work with older rasterio version while boundless is phased out?

sgillies commented 6 years ago

Update the requirement, please.

jdmcbr commented 6 years ago

@sgillies Thanks for all your time on this! From my perspective, this looks ready to go.

jdmcbr commented 6 years ago

@sgillies The build fail indicates this might not be ready after all. python3 build is running fine, but in python2, the test I added is taking 10+ minutes to run, eventually triggering a failure. I cannot reproduce this in python2 on my machine. I also cannot see any plausible explanation for why this test would fail. Do you have any ideas?

jdmcbr commented 6 years ago

I seem to be stuck on trying to reproduce the error in python2 in the test I added. I've got a different version of ubuntu and a slightly different python version (2.7.14 vs 2.7.13), but all packages the same (aside from numpy, whose version wasn't listed in the travis build, but I tried a few different recent versions without getting a failure). There's no local resource required, or anything else that seems like it might cause this process to hang indefinitely, outside of the code not working. I just tried out another test (perhaps better than the original test in some sense anyway, with a test of a tif without any valid data), and getting the same behavior. I'm out of ideas that don't involve logging into the travis build or using it in debug mode, which are tools I don't think I have access to.

sgillies commented 6 years ago

On Travis, I count 11 dots

$ py.test --cov mbtiles --cov-report term-missing
Test data present.
============================= test session starts ==============================
platform linux2 -- Python 2.7.13, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /home/travis/build/mapbox/rio-mbtiles, inifile:
plugins: cov-2.5.1
collected 13 items                                                              
tests/test_cli.py ...........
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

On my laptop with -v it looks like tests/test_cli.py::test_skip_empty is the one that hangs.

platform darwin -- Python 3.5.4, pytest-3.2.2, py-1.4.34, pluggy-0.4.0 -- /Users/sean/envs/pydotorg35/bin/python
cachedir: .cache
rootdir: /Users/sean/code/rio-mbtiles, inifile:
plugins: logging-2015.11.4, cov-2.5.1, hypothesis-3.30.3
collected 13 items

tests/test_cli.py::test_cli_help PASSED
tests/test_cli.py::test_nodata_validation PASSED
tests/test_cli.py::test_export_metadata PASSED
tests/test_cli.py::test_export_overwrite PASSED
tests/test_cli.py::test_export_metadata_output_opt PASSED
tests/test_cli.py::test_export_tiles PASSED
tests/test_cli.py::test_export_zoom PASSED
tests/test_cli.py::test_export_jobs PASSED
tests/test_cli.py::test_export_src_nodata PASSED
tests/test_cli.py::test_export_dump PASSED
tests/test_cli.py::test_export_bilinear PASSED
tests/test_cli.py::test_skip_empty PASSED
tests/test_mod.py::test_process_tile PASSED

I can make a few minutes to look into this today, but not much more.

sgillies commented 6 years ago

@jdmcbr I restarted the build to see what would happen. This time it returned, but with this error: https://travis-ci.org/mapbox/rio-mbtiles/jobs/311680662#L561.

The Google is full of hits for OSError(12, 'Cannot allocate memory').

I wonder if we have a memory leak, but only only Python 2.7?

jdmcbr commented 6 years ago

Oh yes, sorry, I should have added that I was confident that it was the test I'd added that was hanging. I didn't notice that the latest version of the test I'd added had actually failed last night with the same error as when you just restarted the build (I saw the build at 5+ minutes and no return, then saw it had failed a few minutes later, and incorrectly assumed it was the same). I suppose this is progress of some sort, though I'm not sure what to make of the memory allocation error either. I appreciate your time on this, and understand that with everything you're working on this isn't highest priority!

jdmcbr commented 6 years ago

I did a very basic check on memory usage with psutil while running tests. I didn't see any sort of weird memory behavior in either 3.6 or 2.7; all forms of memory were pretty stable, as would be expected.

sgillies commented 6 years ago

@jdmcbr I searched the Python tracker for sqlite3 issues + Python 2.7 and turned up one that sounds related. At the very least, it has a workaround for a problem with libsqlite3 on OS X that we may want to add: https://bugs.python.org/issue20353#msg266397. Could you give that a read and see if you see any similarities to our problem?

jdmcbr commented 6 years ago

@sgillies Thanks for the suggestion! Given the issue sounded specific to OS X, and the build is happening in Ubuntu, I wasn't very optimistic that this would help, but gave it a shot (I think I put the suggested fix in the right place...). Interestingly, this did result in a change: now the build fails at test_export_zoom, rather than test_skip_empty, and it is back to an issue with timing out. I'm not really sure what to make of this still, but at least another new avenue to investigate, so I'll give this some more time later.

sgillies commented 6 years ago

@jdmcbr okay, I think that was a red herring.

Check this out: https://travis-ci.org/mapbox/rio-mbtiles/builds. Our master is broken now with no changes other than the addition of mercantile. I think this is only about Travis's Python 2.7.

💩 💩 💩 💩 💩 😂

jdmcbr commented 6 years ago

@sgillies Uf! I should have looked at the build for that before, but didn't even think about it since it was just a requirements change.

jdmcbr commented 6 years ago

@sgillies I'm guessing no desire to make any moves on this, or #33, until the travis build issues are settled? I've spent a little time looking at the build logs without finding anything remotely interesting.

sgillies commented 6 years ago

@jdmcbr it looks to me that there was an unacknowledged bug in Python 2.7.14. Your PR passes with 2.7.13 and 2.7.15. This will be included in 1.4.0.

jdmcbr commented 6 years ago

Interesting. Glad to have it incorporated now!