Closed jdmcbr closed 6 years ago
@jdmcbr at first read, this seems to me to do too much work per tile. I could be wrong out that. You do have any benchmarks?
Would you be willing to consider either checking the image bytes at https://github.com/mapbox/rio-mbtiles/blob/master/mbtiles/__init__.py#L59 to see if the tiles are empty. It seems like all-0 JPEGs or PNGs should have a signature we could use.
It would also be possible to read the mask of the vsimem temp file and see if that has any valid pixels.
@sgillies My intuition, possibly off base, was that any check that waited until after the reproject
call would likely have minimal value in decreasing run time. I didn't test that though, and certainly could try. I agree though that the work per tile is too high as written now, which was the main motivation to check in to see if you had ideas where this check might better go.
All my benchmarking experience is basically using ipython tools, so for the rough results I reported above, I was just using the unix time utility.
Evidently I forgot to run the test suite after making this change. I'm not seeing at the moment what the issue is; running rio mbtiles
on the image in the test data folder produces fine looking results, including a tile image at the location of the tile passed in for the test. And all the tests files I ran this on were produced appropriately. I'll look again at this tomorrow.
Anyway, this is a fairly small change from my original version, compared to the others you suggested. I couldn't see how to read the vsimem temp file to get the answer without going through similar steps as in the approach I started with. This small change was enough to get it do pretty well over the tests I ran. Though I'm sure there are test datasets out there for which the version in this branch would do less well, as compared to master. For my test images though (the one labeled 80% nodata raster), this branch did at worst just about as well as master, and at best (for one of the files that really motivated me to look at this), it shaved ~30% off run time. For that raster, the lower zoom levels didn't actually include a ton of tiles with nodata, so the speeds were fairly comparable, but once I got to zoom 17 and especially 18, many of those tiles were nodata, and started to produce noticeable gains. Consistent with that, another raster had ~50% nodata, and I actually got a ~40% time shaving because I went over a wider range of zoom levels (down to 19).
master, 0% nodata raster, zoom 10..20
real 0m24.095s
user 0m28.818s
sys 0m1.004s
real 0m24.337s
user 0m28.973s
sys 0m1.088s
skip-v2, 0% nodata raster, zoom 10..20
real 0m23.988s
user 0m29.300s
sys 0m1.008s
real 0m24.026s
user 0m29.923s
sys 0m1.125s
-------------------------------------------------
master, 80% nodata raster, zoom 12..14
real 0m35.410s
user 1m39.214s
sys 0m2.562s
skip-v2, 80% nodata raster, zoom 12..14
real 0m28.367s
user 1m21.301s
sys 0m1.813s
-------------------------------------------------
master, 80% nodata raster, zoom 12..15
real 1m7.456s
user 3m15.152s
sys 0m4.691s
skip-v2, 80% nodata raster, zoom 12..15
real 0m59.370s
user 2m51.536s
sys 0m3.985s
-------------------------------------------------
master, 80% nodata raster, zoom 12..16
real 2m18.094s
user 6m37.580s
sys 0m12.858s
skip-v2, 80% nodata raster, zoom 12..16
real 2m12.722s
user 6m21.153s
sys 0m11.699s
-------------------------------------------------
master, 80% nodata raster, zoom 12..17
real 5m36.701s
user 14m46.881s
sys 0m45.438s
skip-v2, 80% nodata raster, zoom 12..17
real 4m45.447s
user 13m26.167s
sys 0m37.342s
-------------------------------------------------
master, 80% nodata raster, zoom 12..18
real 17m55.388s
user 35m10.248s
sys 2m50.693s
skip-v2, 80% nodata raster, zoom 12..18
real 11m44.133s
user 32m10.696s
sys 2m22.486s
-------------------------------------------------
master, 50% nodata raster, zoom 10..19
real 0m22.587s
user 0m22.241s
sys 0m1.013s
real 0m22.699s
user 0m22.238s
sys 0m1.060s
skip-v2, 50% nodata raster, zoom 10..19
real 0m13.593s
user 0m21.957s
sys 0m0.913s
real 0m13.679s
user 0m21.750s
sys 0m0.828s
@sgillies Apart from the tile size discussion, I'd appreciate a sanity check on one thing before merging. For the test_process_tile
, you made a change in #36 to the number of expected output tiles, from 6 to 5, based on the nodata check. I did not make such a change here, because this approach is still generating 6 tiles in that test. It seemed like it was actually correct that there should be windows at the edges that have data, but would be all nodata in the tile generated from the reproject
call.
Once I was convinced that I wasn't missing anything, I also wanted to add a second test like test_process_tile
with an additional zoom level for the RGB data set, in which some of the tiles are skipped.
@sgillies As it stands, this change requires rasterio>=1.0a10, due to the change in the boundless
handling in windows.from_bounds
. Should I update the requirements file? Or should this be modified to work with older rasterio version while boundless
is phased out?
Update the requirement, please.
@sgillies Thanks for all your time on this! From my perspective, this looks ready to go.
@sgillies The build fail indicates this might not be ready after all. python3 build is running fine, but in python2, the test I added is taking 10+ minutes to run, eventually triggering a failure. I cannot reproduce this in python2 on my machine. I also cannot see any plausible explanation for why this test would fail. Do you have any ideas?
I seem to be stuck on trying to reproduce the error in python2 in the test I added. I've got a different version of ubuntu and a slightly different python version (2.7.14 vs 2.7.13), but all packages the same (aside from numpy, whose version wasn't listed in the travis build, but I tried a few different recent versions without getting a failure). There's no local resource required, or anything else that seems like it might cause this process to hang indefinitely, outside of the code not working. I just tried out another test (perhaps better than the original test in some sense anyway, with a test of a tif without any valid data), and getting the same behavior. I'm out of ideas that don't involve logging into the travis build or using it in debug mode, which are tools I don't think I have access to.
On Travis, I count 11 dots
$ py.test --cov mbtiles --cov-report term-missing
Test data present.
============================= test session starts ==============================
platform linux2 -- Python 2.7.13, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /home/travis/build/mapbox/rio-mbtiles, inifile:
plugins: cov-2.5.1
collected 13 items
tests/test_cli.py ...........
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received
On my laptop with -v
it looks like tests/test_cli.py::test_skip_empty
is the one that hangs.
platform darwin -- Python 3.5.4, pytest-3.2.2, py-1.4.34, pluggy-0.4.0 -- /Users/sean/envs/pydotorg35/bin/python
cachedir: .cache
rootdir: /Users/sean/code/rio-mbtiles, inifile:
plugins: logging-2015.11.4, cov-2.5.1, hypothesis-3.30.3
collected 13 items
tests/test_cli.py::test_cli_help PASSED
tests/test_cli.py::test_nodata_validation PASSED
tests/test_cli.py::test_export_metadata PASSED
tests/test_cli.py::test_export_overwrite PASSED
tests/test_cli.py::test_export_metadata_output_opt PASSED
tests/test_cli.py::test_export_tiles PASSED
tests/test_cli.py::test_export_zoom PASSED
tests/test_cli.py::test_export_jobs PASSED
tests/test_cli.py::test_export_src_nodata PASSED
tests/test_cli.py::test_export_dump PASSED
tests/test_cli.py::test_export_bilinear PASSED
tests/test_cli.py::test_skip_empty PASSED
tests/test_mod.py::test_process_tile PASSED
I can make a few minutes to look into this today, but not much more.
@jdmcbr I restarted the build to see what would happen. This time it returned, but with this error: https://travis-ci.org/mapbox/rio-mbtiles/jobs/311680662#L561.
The Google is full of hits for OSError(12, 'Cannot allocate memory').
I wonder if we have a memory leak, but only only Python 2.7?
Oh yes, sorry, I should have added that I was confident that it was the test I'd added that was hanging. I didn't notice that the latest version of the test I'd added had actually failed last night with the same error as when you just restarted the build (I saw the build at 5+ minutes and no return, then saw it had failed a few minutes later, and incorrectly assumed it was the same). I suppose this is progress of some sort, though I'm not sure what to make of the memory allocation error either. I appreciate your time on this, and understand that with everything you're working on this isn't highest priority!
I did a very basic check on memory usage with psutil
while running tests. I didn't see any sort of weird memory behavior in either 3.6 or 2.7; all forms of memory were pretty stable, as would be expected.
@jdmcbr I searched the Python tracker for sqlite3 issues + Python 2.7 and turned up one that sounds related. At the very least, it has a workaround for a problem with libsqlite3 on OS X that we may want to add: https://bugs.python.org/issue20353#msg266397. Could you give that a read and see if you see any similarities to our problem?
@sgillies Thanks for the suggestion! Given the issue sounded specific to OS X, and the build is happening in Ubuntu, I wasn't very optimistic that this would help, but gave it a shot (I think I put the suggested fix in the right place...). Interestingly, this did result in a change: now the build fails at test_export_zoom
, rather than test_skip_empty
, and it is back to an issue with timing out. I'm not really sure what to make of this still, but at least another new avenue to investigate, so I'll give this some more time later.
@jdmcbr okay, I think that was a red herring.
Check this out: https://travis-ci.org/mapbox/rio-mbtiles/builds. Our master is broken now with no changes other than the addition of mercantile. I think this is only about Travis's Python 2.7.
💩 💩 💩 💩 💩 😂
@sgillies Uf! I should have looked at the build for that before, but didn't even think about it since it was just a requirements change.
@sgillies I'm guessing no desire to make any moves on this, or #33, until the travis build issues are settled? I've spent a little time looking at the build logs without finding anything remotely interesting.
@jdmcbr it looks to me that there was an unacknowledged bug in Python 2.7.14. Your PR passes with 2.7.13 and 2.7.15. This will be included in 1.4.0.
Interesting. Glad to have it incorporated now!
This is a work in progress, but I've reached a point where I think feedback would be very useful.
I regularly am converting rasters that have significant areas of nodata into mbtiles, which motivates my desire to reduce time processing areas of nodata. The change submitted here is working well enough in terms of producing the right end results, at least on all cases I've tested upon. It successfully skips over regions with nodata, while producing an mbtiles file that looks the same. For tifs without any area of nodata, this appears to be 10-20% slower, with the
real
time difference somewhat smaller than theuser
time difference. As areas of nodata become more significant, this PR starts to become faster, with a turnover point around 30-40% nodata, unless the--image-dump
option is set, in which case the turnover is earlier. It does depend somewhat on the--zoom-level
range too.The check for nodata in a tile currently happens outside the
process_tile
call. This leads to some duplicated logic in getting tile bounds. I wasn't sure of a clean way though to skip processing a tile after enteringprocess_tile
though, whereas putting it outside the call made it easy to skip over, but maybe I'm missing something obvious? I also wondered whether there might be a faster method of checking whether there is data within specified bounds of a raster.I expect any implementation of this is probably going to be a little slower on tifs without nodata anywhere. Would an extra argument to enable this logic be reasonable, such that the user can decide whether to skip processing nodata regions? On the other hand, it still seems possible that implementing this differently would make the speed difference negligible on a raster that doesn't have any region of nodata, and it is obviously desirable to avoid adding an argument whose purpose would be obscure to most users, and for which the fastest setting would probably not be obvious to anyone.