perrygeo / python-rasterstats

Summary statistics of geospatial raster datasets based on vector geometries.
BSD 3-Clause "New" or "Revised" License
518 stars 115 forks source link

Segmentation Fault (not necessarily caused by memory usage) #286

Open adamleemiller opened 1 year ago

adamleemiller commented 1 year ago

Describe the bug Presumably on large sets of data, my script is segment faulting. Initially, we saw issues due to memory consumption and the script being killed due to OOM however we have since increased the physical memory to 32GB and added a 10GB SWAP file as well. The script ran successfully for some data sources but is now segfaulting again. The current TIFF being parsed is 4.38GB in size. There is only one polygon as we are pulling stats for an entire farm field which is approximately 82 acres in size.

Machine Type: VPS Memory Allocation: 32GB SWAP File Allocation: 10GB Operating System: Ubuntu 22.04.2 Python Version: 3.10.6 PIP Version: 22.0.2 Virtual Environment: Yes GDAL Version: 3.4.3 rasterstats Version: 0.18.0 Other Packages:

zonal_stats(filepath_shapefile, filename_image, nodata=255, band=band["index"], stats=["min", "max", "mean", "count", "sum", "std", "median", "majority", "minority", "unique", "range", "nodata", "percentile_25", "percentile_75"])

Actual Error / Output

Fatal Python error: Segmentation fault

Current thread 0x00007f1b3793d000 (most recent call first):
  File "/data/zonal/venv/lib/python3.10/site-packages/rasterstats/io.py", line 356 in read
  File "/data/zonal/venv/lib/python3.10/site-packages/rasterstats/main.py", line 175 in gen_zonal_stats
  File "/data/zonal/venv/lib/python3.10/site-packages/rasterstats/main.py", line 40 in zonal_stats
  File "/data/zonal/main.py", line 287 in do_get_statistics
  File "/data/zonal/main.py", line 130 in main
  File "/data/zonal/main.py", line 354 in <module>

Extension modules: _mysql_connector, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pandas._libs.ops, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, pyproj._compat, pyproj._datadir, pyproj._network, pyproj._geod, pyproj.list, pyproj._crs, pyproj._transformer, pyproj.database, pyproj._sync, shapely.lib, shapely._geos, shapely._geometry_helpers, osgeo._gdal, osgeo._gdalconst, osgeo._ogr, osgeo._osr, fiona._err, fiona._geometry, fiona._shim, fiona._env, fiona.schema, fiona.ogrext, fiona._crs, rasterio._version, rasterio._err, rasterio._filepath, rasterio._env, rasterio._transform, rasterio._base, rasterio.crs, rasterio._features, rasterio._warp, rasterio._io, simplejson._speedups (total: 89)
Segmentation fault (core dumped)

The script downloads the files from our storage provider then grabs the number of bands and their descriptions after which it loops through the bands and grabs the stats for each band using the same shapefile which was created from a GeoJSON object previously using geopandas:

        geojson = gpd.read_file(filepath_geojson)
        geojson.set_crs("EPSG:4326")
        crs = geojson.to_crs(3857)
        crs.to_file(filename)

It is possible that the coordinate reference system (CRS) is incorrect however I have not had an issue with other datasets using this same code (different fields but same GeoJSON).

I am still new to Python itself but I have 20 years of development experience in other languages so I am not new to programming itself. If anyone can lend a hand as to where to go from here, I would appreciate it. I need to get this working flawlessly because I have nearly 2000 datasets that need to be processed in our production environment.

Thank you.

perrygeo commented 1 year ago

One important note is that it's (practically) impossible to crash the python interpreter without calling into C code. In other words, pure python code will never crash - but C extensions definitely can! In this case, the segfault happens when reading the raster dataset filename_image

Line 356 of io.py is calling the Rasterio dataset's read method, which itself calls into the shared C library libgdal. I would put a debug breakpoint right before (import pdb; pdb.set_trace()) and see exactly what parameters are being passed. That should allow you to reproduce the segfault using a direct Rasterio function call (rather than an indirect call to zonal stats).

Some possibilities

  1. Your Rasterio and/or GDAL installation is broken in some way - libgdal is a dynamically linked C library so anything is possible (ie it could be a bug in your system setup)
  2. Rasterio/libgdal is legitimately segfault on your input data (ie it could be a bug in GDAL)

While it's theoretically possible that this library could be passing the wrong inputs to rasterio/libgdal, it's pure python code so it won't segfault on its own. In other words, if there were a bug in this library it would manifest itself in incorrect results or exceptions, never a segfault. For segfault, you need to go down a layer to the C dependencies.

adamleemiller commented 1 year ago

One important note is that it's (practically) impossible to crash the python interpreter without calling into C code. In other words, pure python code will never crash - but C extensions definitely can! In this case, the segfault happens when reading the raster dataset filename_image

Line 356 of io.py is calling the Rasterio dataset's read method, which itself calls into the shared C library libgdal. I would put a debug breakpoint right before (import pdb; pdb.set_trace()) and see exactly what parameters are being passed. That should allow you to reproduce the segfault using a direct Rasterio function call (rather than an indirect call to zonal stats).

Some possibilities

1. Your Rasterio and/or GDAL installation is broken in some way - libgdal is a dynamically linked C library so anything is possible (ie it could be a bug in your system setup)

2. Rasterio/libgdal is legitimately segfault on your input data (ie it could be a bug in GDAL)

While it's theoretically possible that this library could be passing the wrong inputs to rasterio/libgdal, it's pure python code so it won't segfault on its own. In other words, if there were a bug in this library it would manifest itself in incorrect results or exceptions, never a segfault. For segfault, you need to go down a layer to the C dependencies.

Thank you for getting back to me and I apologize for the delayed response. While working on the script locally, I tried adding a break / pause using PyCharm however, it would always fail. I then added an exist and a print(self.band) above line 356 but the print statement was empty. Next, using the rasterio command line interface (rio), I tried running "rio zonalstats --raster vari-cog.tiff" but after about 4 hours and it never finishing, I killed it. What's weird is while running the Python script, it finally showed me an error when it segfaulted which was out of memory. Not sure how that can be considering the file is 3GB in size and this computer (macOS / MacPro Trashcan) has 128GB of physical memory. Plus, the servers I have been running the script on to generate our stats for our current data set have 30GB memory plus a 10GB SWAP file.

I am not really sure where to go from here but I really need to figure this out. It would be better if I could just stop the segfault and have it skip the file so that it continues on instead of the script dying but that isn't possible from my knowledge.

When I gdalinfo the file, I see the bands:

Driver: GTiff/GeoTIFF Files: vari-cog.tiff vari-cog.tiff.msk Size is 131072, 131072 Coordinate System is: PROJCRS["WGS 84 / Pseudo-Mercator", BASEGEOGCRS["WGS 84", ENSEMBLE["World Geodetic System 1984 ensemble", MEMBER["World Geodetic System 1984 (Transit)"], MEMBER["World Geodetic System 1984 (G730)"], MEMBER["World Geodetic System 1984 (G873)"], MEMBER["World Geodetic System 1984 (G1150)"], MEMBER["World Geodetic System 1984 (G1674)"], MEMBER["World Geodetic System 1984 (G1762)"], MEMBER["World Geodetic System 1984 (G2139)"], ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]], ENSEMBLEACCURACY[2.0]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4326]], CONVERSION["Popular Visualisation Pseudo-Mercator", METHOD["Popular Visualisation Pseudo Mercator", ID["EPSG",1024]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["False easting",0, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]]], CS[Cartesian,2], AXIS["easting (X)",east, ORDER[1], LENGTHUNIT["metre",1]], AXIS["northing (Y)",north, ORDER[2], LENGTHUNIT["metre",1]], USAGE[ SCOPE["Web mapping and visualisation."], AREA["World between 85.06°S and 85.06°N."], BBOX[-85.06,-180,85.06,180]], ID["EPSG",3857]] Data axis to CRS axis mapping: 1,2 Origin = (-10199757.054373919963837,5070526.708325451239944) Pixel Size = (0.018661383858686,-0.018661383858686) Metadata: AREA_OR_POINT=Area TIFFTAG_SOFTWARE=Pix4Dfields 1.9.0 (b7495734c) Image Structure Metadata: COMPRESSION=LZW INTERLEAVE=PIXEL LAYOUT=COG Tiling Scheme: ALIGNED_LEVELS=10 NAME=GoogleMapsCompatible ZOOM_LEVEL=23 Corner Coordinates: Upper Left (-10199757.054, 5070526.708) ( 91d37'33.52"W, 41d23'35.86"N) Lower Left (-10199757.054, 5068080.723) ( 91d37'33.52"W, 41d22'36.51"N) Upper Right (-10197311.069, 5070526.708) ( 91d36'14.41"W, 41d23'35.86"N) Lower Right (-10197311.069, 5068080.723) ( 91d36'14.41"W, 41d22'36.51"N) Center (-10198534.062, 5069303.716) ( 91d36'53.96"W, 41d23' 6.19"N) Band 1 Block=256x256 Type=Byte, ColorInterp=Red Overviews: 65536x65536, 32768x32768, 16384x16384, 8192x8192, 4096x4096, 2048x2048, 1024x1024, 512x512, 256x256 Mask Flags: PER_DATASET ALPHA Band 2 Block=256x256 Type=Byte, ColorInterp=Green Overviews: 65536x65536, 32768x32768, 16384x16384, 8192x8192, 4096x4096, 2048x2048, 1024x1024, 512x512, 256x256 Mask Flags: PER_DATASET ALPHA Band 3 Block=256x256 Type=Byte, ColorInterp=Blue Overviews: 65536x65536, 32768x32768, 16384x16384, 8192x8192, 4096x4096, 2048x2048, 1024x1024, 512x512, 256x256 Mask Flags: PER_DATASET ALPHA Band 4 Block=256x256 Type=Byte, ColorInterp=Alpha Overviews: 65536x65536, 32768x32768, 16384x16384, 8192x8192, 4096x4096, 2048x2048, 1024x1024, 512x512, 256x256 Mask Flags: PER_DATASET ALPHA

Any additional help is greatly appreciated.

Thank you.