lambgeo / docker-lambda

AWS Lambda friendly GDAL Docker images and AWS Lambda layer
MIT License
128 stars 19 forks source link

Is latest 3.6 compiled with parquet / arrow enabled? #57

Open ncgl-syngenta opened 1 year ago

ncgl-syngenta commented 1 year ago

Not proficient in C++, but running into this problem when trying to run ogr2ogr with a parquet output:

import subprocess

return_code = subprocess.Popen(["ogr2ogr", "-f", "Parquet", "somedestination", "somelocation", "--debug", "ON"], stdout=subprocess.PIPE).poll()

this outputs this:


b'GDAL 3.6.4, released 2023/04/17\n'
--
ERROR 1: Unable to find driver `Parquet'.
[ERROR] FileNotFoundError: somedestination
Traceback (most recent call last):  File "/var/task/epsagon/wrappers/aws_lambda.py", line 137, in _lambda_wrapper
result = func(*args, **kwargs)  File "/var/task/application/v1/controller/console/test_gdal.py", line 26, in test
 df = gpd.read_parquet("somedestination")  File "/mnt/efs/lib/geopandas/io/arrow.py", line 560, in _read_parquet    table = parquet.read_table(path, columns=columns, filesystem=filesystem, **kwargs)  File "/mnt/efs/lib/pyarrow/parquet/core.py", line 2926, in read_table    dataset = _ParquetDatasetV2(  File "/mnt/efs/lib/pyarrow/parquet/core.py", line 2477, in __init__    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,  File "/mnt/efs/lib/pyarrow/dataset.py", line 762, in dataset    return _filesystem_dataset(source, **kwargs)  File "/mnt/efs/lib/pyarrow/dataset.py", line 445, in _filesystem_dataset    fs, paths_or_selector = _ensure_single_source(source, filesystem)  File "/mnt/efs/lib/pyarrow/dataset.py", line 421, in _ensure_single_source    raise FileNotFoundError(path)

file paths have been replaced.

So - diving into the Cmake flags I see this: https://github.com/OSGeo/gdal/blob/634f60a4181c9db067a64dbfdd9f2872e4992927/ogr/ogrsf_frmts/generic/ogrregisterall.cpp#L251

but don't see anything specifically disabling it in the build, so anyone who can read C++ can you tell me if outputting to parquet is possible in the version built for this image?

vincentsarago commented 1 year ago

we're not adding librarrow so I guess this is why the Parquet driver in not available. But it will be a nice addition.

I'm not sure I'll have time right now sadly but I'll be happy to review any PR 🙏

ref: https://gdal.org/development/building_from_source.html#arrow https://gdal.org/development/building_from_source.html#parquet

ncgl-syngenta commented 1 year ago

Ok - I've been able to build my own version with arrow. I removed a lot of other installs since our specific use case only needs parquet support, but the zip size still shoots up, mainly because of an arrow dependency file (40MB on its own)

Im going to paste the full dockerfile that we're using, in case anyone comes across this:

# modified from https://github.com/lambgeo/docker-lambda/blob/master/dockerfiles/Dockerfile.gdal3.6

FROM public.ecr.aws/lambda/provided:al2 as builder

RUN yum makecache fast
RUN yum install -y autoconf libtool flex bison cmake make tar gzip gcc gcc-c++ automake16 nasm readline-devel openssl-devel curl-devel cmake3

ENV PREFIX=/opt
WORKDIR /opt

ENV LD_LIBRARY_PATH $PREFIX/lib:$LD_LIBRARY_PATH

# pkg-config
ENV PKGCONFIG_VERSION=0.29.2
RUN mkdir /tmp/pkg-config \
  && curl -sfL https://pkg-config.freedesktop.org/releases/pkg-config-${PKGCONFIG_VERSION}.tar.gz | tar zxf - -C /tmp/pkg-config --strip-components=1 \
  && cd /tmp/pkg-config \
  && CFLAGS="-O2 -Wl,-S" ./configure --prefix=$PREFIX --with-internal-glib \
  && make -j $(nproc) --silent && make install && make clean \
  && rm -rf /tmp/pkg-config

ENV PKG_CONFIG_PATH=$PREFIX/lib/pkgconfig/

# sqlite
RUN mkdir /tmp/sqlite \
  && curl -sfL https://www.sqlite.org/2020/sqlite-autoconf-3330000.tar.gz | tar zxf - -C /tmp/sqlite --strip-components=1 \
  && cd /tmp/sqlite \
  && CFLAGS="-O2 -Wl,-S" CXXFLAGS="-O2 -Wl,-S" ./configure --prefix=$PREFIX --disable-static \
  && make -j $(nproc) --silent && make install && make clean \
  && rm -rf /tmp/sqlite

ENV \
  SQLITE3_LIBS="-L${PREFIX}/lib -lsqlite3" \
  SQLITE3_INCLUDE_DIR="${PREFIX}/include" \
  SQLITE3_CFLAGS="$CFLAGS -I${PREFIX}/include" \
  PATH=${PREFIX}/bin/:$PATH

# nghttp2
ENV NGHTTP2_VERSION=1.42.0
RUN mkdir /tmp/nghttp2 \
  && curl -sfL https://github.com/nghttp2/nghttp2/releases/download/v${NGHTTP2_VERSION}/nghttp2-${NGHTTP2_VERSION}.tar.gz | tar zxf - -C /tmp/nghttp2 --strip-components=1 \
  && cd /tmp/nghttp2 \
  && ./configure --enable-lib-only --prefix=$PREFIX \
  && make -j $(nproc) --silent && make install \
  && rm -rf /tmp/nghttp2

# libcurl
ENV CURL_VERSION=7.73.0
RUN mkdir /tmp/libcurl \
  && curl -sfL https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz | tar zxf - -C /tmp/libcurl --strip-components=1 \
  && cd /tmp/libcurl \
  && ./configure --disable-manual --disable-cookies --with-nghttp2=$PREFIX --prefix=$PREFIX \
  && make -j $(nproc) --silent && make install \
  && rm -rf /tmp/libcurl

# libtiff
ENV LIBTIFF_VERSION=4.5.0
RUN mkdir /tmp/libtiff \
  && curl -sfL https://download.osgeo.org/libtiff/tiff-${LIBTIFF_VERSION}.tar.gz | tar zxf - -C /tmp/libtiff --strip-components=1 \
  && cd /tmp/libtiff \
  && LDFLAGS="-Wl,-rpath,'\$\$ORIGIN'" CFLAGS="-O2 -Wl,-S" CXXFLAGS="-O2 -Wl,-S" ./configure \
    --prefix=$PREFIX \
    --disable-static \
    --enable-rpath \
  && make -j $(nproc) --silent && make install \
  && rm -rf /tmp/libtiff

# geos
ENV GEOS_VERSION=3.11.2
RUN mkdir /tmp/geos \
  && curl -sfL https://github.com/libgeos/geos/archive/refs/tags/${GEOS_VERSION}.tar.gz | tar zxf - -C /tmp/geos --strip-components=1 \
  && cd /tmp/geos \
  && mkdir build && cd build \
  && cmake3 .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_TESTING=NO \
    -DCMAKE_INSTALL_PREFIX:PATH=$PREFIX \
    -DCMAKE_INSTALL_LIBDIR:PATH=lib \
    -DCMAKE_C_FLAGS="-O2 -Wl,-S" \
    -DCMAKE_CXX_FLAGS="-O2 -Wl,-S" \
  && make -j $(nproc) --silent && make install \
  && rm -rf /tmp/geos

ENV PROJ_VERSION=9.2.0
RUN mkdir /tmp/proj && mkdir /tmp/proj/data \
  && curl -sfL https://github.com/OSGeo/proj/archive/${PROJ_VERSION}.tar.gz | tar zxf - -C /tmp/proj --strip-components=1 \
  && cd /tmp/proj \
  && mkdir build && cd build \
  && cmake3 .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX:PATH=$PREFIX \
    -DCMAKE_INSTALL_LIBDIR:PATH=lib \
    -DCMAKE_INSTALL_INCLUDEDIR:PATH=include \
    -DBUILD_TESTING=OFF \
    -DCMAKE_C_FLAGS="-O2 -Wl,-S" \
    -DCMAKE_CXX_FLAGS="-O2 -Wl,-S" \
  && make -j $(nproc) --silent && make install \
  && rm -rf /tmp/proj

ENV ARROW_VERSION=12.0.0
RUN mkdir /tmp/arrow \
    && curl -sfL "https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/arrow-${ARROW_VERSION}/apache-arrow-${ARROW_VERSION}.tar.gz" | tar zxf - -C /tmp/arrow --strip-components=1 \
    && cd /tmp/arrow/cpp \
    && mkdir build && cd build \
    && cmake3 .. \
    -DCMAKE_INSTALL_PREFIX=$PREFIX \
    -DCMAKE_PREFIX_PATH=$PREFIX \
    -DCMAKE_INSTALL_LIBDIR=lib \
    -Dxsimd_SOURCE=BUNDLED \
    -DCMAKE_C_FLAGS="-O2 -Wl,-S" \
    -DCMAKE_CXX_FLAGS="-O2 -Wl,-S" \
    -DARROW_BUILD_TESTS=OFF \
    -DARROW_PARQUET=ON \
    && make -j $(nproc) --silent && make install \
    && rm -rf /tmp/arrow

# We use commit sha to make sure we are not using `cache` when building the docker image
# "7ca88116f5a46d429251361634eb24629f315076" is the latest commit on release/3.6 branch

# gdal
RUN mkdir /tmp/gdal \
  && curl -sfL https://github.com/OSGeo/gdal/archive/7ca88116f5a46d429251361634eb24629f315076.tar.gz | tar zxf - -C /tmp/gdal --strip-components=1 \
  && cd /tmp/gdal \
  && mkdir build && cd build \
  && cmake3 .. \
    -DGDAL_USE_EXTERNAL_LIBS=ON \
    -DCMAKE_BUILD_TYPE=MinSizeRel \
    -DCMAKE_INSTALL_PREFIX:PATH=$PREFIX \
    -DCMAKE_INSTALL_LIBDIR:PATH=lib \
    -DCMAKE_PREFIX_PATH=lib \
    -DGDAL_SET_INSTALL_RELATIVE_RPATH=ON \
    -DBUILD_PYTHON_BINDINGS=OFF \
    -DBUILD_TESTING=OFF \
    -DCMAKE_C_FLAGS="-O2 -Wl,-S" \
    -DCMAKE_CXX_FLAGS="-O2 -Wl,-S" \
    -DGDAL_BUILD_OPTIONAL_DRIVERS=OFF \
    -DOGR_BUILD_OPTIONAL_DRIVERS=OFF \
    -DGDAL_USE_PARQUET=ON \
    -DGDAL_USE_ARROW=ON \
    -DOGR_ENABLE_DRIVER_ARROW=ON \
    -DOGR_ENABLE_DRIVER_ARROW_PLUGIN=ON \
    -DOGR_ENABLE_DRIVER_PARQUET=ON \
    -DOGR_ENABLE_DRIVER_PARQUET_PLUGIN=ON \
  && make -j $(nproc) --silent && make install \
  && rm -rf /tmp/gdal

# from https://github.com/pypa/manylinux/blob/d8ef5d47433ba771fa4403fd48f352c586e06e43/docker/build_scripts/build.sh#L133-L138
# Install patchelf (latest with unreleased bug fixes)
ENV PATCHELF_VERSION=0.10
RUN mkdir /tmp/patchelf \
  && curl -sfL https://github.com/NixOS/patchelf/archive/${PATCHELF_VERSION}.tar.gz | tar zxf - -C /tmp/patchelf --strip-components=1 \
  && cd /tmp/patchelf \
  && ./bootstrap.sh \
  && ./configure \
  && make -j $(nproc) --silent && make install \
  && cd / && rm -rf /tmp/patchelf

# FIX
RUN for i in $PREFIX/bin/*; do patchelf --force-rpath --set-rpath '$ORIGIN/../lib' $i; done

# Build final image
FROM public.ecr.aws/lambda/provided:al2 as runner

ENV PREFIX=/opt

COPY --from=builder $PREFIX/lib/ $PREFIX/lib/
COPY --from=builder $PREFIX/include/ $PREFIX/include/
COPY --from=builder $PREFIX/share/ $PREFIX/share/
COPY --from=builder $PREFIX/bin/ $PREFIX/bin/

RUN export GDAL_VERSION=$(gdal-config --version)

RUN yum install -y zip binutils

# remove any unneeded files
RUN rm -rdf $PREFIX/share/doc \
    && rm -rdf $PREFIX/share/man \
    && rm -rdf $PREFIX/share/cryptopp \
    && rm -rdf $PREFIX/share/hdf*

RUN cd $PREFIX \
    && find lib/ -type f -name \*.so\* -exec strip {} \; \
    && zip -r9q --symlinks /tmp/package.zip lib \
    && zip -r9q --symlinks /tmp/package.zip share \
    && zip -r9q --symlinks /tmp/package.zip bin/gdal* bin/ogr* bin/geos* bin/arrow* bin/proj* \
    && mv /tmp/package.zip /package.zip

FROM scratch AS exporter
COPY --from=runner /package.zip .
vincentsarago commented 1 year ago

it seems there are a lot of arrow compilation options that are set to ON by default that could be changed https://arrow.apache.org/docs/developers/cpp/building.html#optional-components

vincentsarago commented 1 year ago

some related issues

https://github.com/apache/arrow/issues/33126 https://github.com/aws/aws-sdk-pandas/pull/1977