Open yurigba opened 3 years ago
Curiously, reading using gdal library in pyspark works:
>>> import gdal
>>> from osgeo import gdal, gdalconst
>>> ds = gdal.Open("/vsihdfs/hdfs://DEVEL/user/884.tif", gdalconst.GA_ReadOnly)
20/11/26 23:49:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/26 23:49:22 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
>>> print(ds.GetMetadata())
{'AREA_OR_POINT': 'Area'}
Looking at the scala code, there may be something related to this:
In spark-shell GDALRasterSource.hasGDAL
returns true in the cluster, yet the function gdalOnly
may be retuning a false
boolean when parsing the URI. The function returns true
when giving a valid java.net.URI
with a .jp2
extension, but false when giving something else:
scala> RFRasterSource.IsGDAL.gdalOnly(URI.create("a.jp2"))
Boolean = true
scala> RFRasterSource.IsGDAL.gdalOnly(URI.create("a"))
Boolean = false
Considering this, it doesn't checks directly if GDAL is present, but if the path is GDAL-readable. This is consistent to the error received when a JP2 file is sent to spark.read.raster
.
My (un)educated guess is that the received URI is receiving something that is not expected by gdalOnly.
Solved uploading libhdfs.so.0.0.0 with parameter --files=/usr/hdp/3.1.0.0-78/usr/lib/libhdfs.so.0.0.0
on pyspark
Since this is very pontual yet very important, it would be a good idea to check if this library is loaded when reading inside HDFS then (with GDAL)
I am using pyrasterframes in cluster mode, by invoking pyspark shell as specified here:
The tiff file used is from GLAD and can be downloaded freely here but needs registration.
By loading pyspark shell, we then run the following commands:
We see that reading tiff files using HadoopGeoTiffRasterSource works. However, it is intended to read jp2 and zip files, and GDAL is needed.
To check if GDAL is recognized by pyrasterframes, we run:
By trying to use GDAL driver (as stated here):
We get this error: