Open ncgl-syngenta opened 1 year ago
we're not adding librarrow so I guess this is why the Parquet
driver in not available. But it will be a nice addition.
I'm not sure I'll have time right now sadly but I'll be happy to review any PR 🙏
ref: https://gdal.org/development/building_from_source.html#arrow https://gdal.org/development/building_from_source.html#parquet
Ok - I've been able to build my own version with arrow. I removed a lot of other installs since our specific use case only needs parquet support, but the zip size still shoots up, mainly because of an arrow dependency file (40MB on its own)
Im going to paste the full dockerfile that we're using, in case anyone comes across this:
# modified from https://github.com/lambgeo/docker-lambda/blob/master/dockerfiles/Dockerfile.gdal3.6
FROM public.ecr.aws/lambda/provided:al2 as builder
RUN yum makecache fast
RUN yum install -y autoconf libtool flex bison cmake make tar gzip gcc gcc-c++ automake16 nasm readline-devel openssl-devel curl-devel cmake3
ENV PREFIX=/opt
WORKDIR /opt
ENV LD_LIBRARY_PATH $PREFIX/lib:$LD_LIBRARY_PATH
# pkg-config
ENV PKGCONFIG_VERSION=0.29.2
RUN mkdir /tmp/pkg-config \
&& curl -sfL https://pkg-config.freedesktop.org/releases/pkg-config-${PKGCONFIG_VERSION}.tar.gz | tar zxf - -C /tmp/pkg-config --strip-components=1 \
&& cd /tmp/pkg-config \
&& CFLAGS="-O2 -Wl,-S" ./configure --prefix=$PREFIX --with-internal-glib \
&& make -j $(nproc) --silent && make install && make clean \
&& rm -rf /tmp/pkg-config
ENV PKG_CONFIG_PATH=$PREFIX/lib/pkgconfig/
# sqlite
RUN mkdir /tmp/sqlite \
&& curl -sfL https://www.sqlite.org/2020/sqlite-autoconf-3330000.tar.gz | tar zxf - -C /tmp/sqlite --strip-components=1 \
&& cd /tmp/sqlite \
&& CFLAGS="-O2 -Wl,-S" CXXFLAGS="-O2 -Wl,-S" ./configure --prefix=$PREFIX --disable-static \
&& make -j $(nproc) --silent && make install && make clean \
&& rm -rf /tmp/sqlite
ENV \
SQLITE3_LIBS="-L${PREFIX}/lib -lsqlite3" \
SQLITE3_INCLUDE_DIR="${PREFIX}/include" \
SQLITE3_CFLAGS="$CFLAGS -I${PREFIX}/include" \
PATH=${PREFIX}/bin/:$PATH
# nghttp2
ENV NGHTTP2_VERSION=1.42.0
RUN mkdir /tmp/nghttp2 \
&& curl -sfL https://github.com/nghttp2/nghttp2/releases/download/v${NGHTTP2_VERSION}/nghttp2-${NGHTTP2_VERSION}.tar.gz | tar zxf - -C /tmp/nghttp2 --strip-components=1 \
&& cd /tmp/nghttp2 \
&& ./configure --enable-lib-only --prefix=$PREFIX \
&& make -j $(nproc) --silent && make install \
&& rm -rf /tmp/nghttp2
# libcurl
ENV CURL_VERSION=7.73.0
RUN mkdir /tmp/libcurl \
&& curl -sfL https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz | tar zxf - -C /tmp/libcurl --strip-components=1 \
&& cd /tmp/libcurl \
&& ./configure --disable-manual --disable-cookies --with-nghttp2=$PREFIX --prefix=$PREFIX \
&& make -j $(nproc) --silent && make install \
&& rm -rf /tmp/libcurl
# libtiff
ENV LIBTIFF_VERSION=4.5.0
RUN mkdir /tmp/libtiff \
&& curl -sfL https://download.osgeo.org/libtiff/tiff-${LIBTIFF_VERSION}.tar.gz | tar zxf - -C /tmp/libtiff --strip-components=1 \
&& cd /tmp/libtiff \
&& LDFLAGS="-Wl,-rpath,'\$\$ORIGIN'" CFLAGS="-O2 -Wl,-S" CXXFLAGS="-O2 -Wl,-S" ./configure \
--prefix=$PREFIX \
--disable-static \
--enable-rpath \
&& make -j $(nproc) --silent && make install \
&& rm -rf /tmp/libtiff
# geos
ENV GEOS_VERSION=3.11.2
RUN mkdir /tmp/geos \
&& curl -sfL https://github.com/libgeos/geos/archive/refs/tags/${GEOS_VERSION}.tar.gz | tar zxf - -C /tmp/geos --strip-components=1 \
&& cd /tmp/geos \
&& mkdir build && cd build \
&& cmake3 .. \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_TESTING=NO \
-DCMAKE_INSTALL_PREFIX:PATH=$PREFIX \
-DCMAKE_INSTALL_LIBDIR:PATH=lib \
-DCMAKE_C_FLAGS="-O2 -Wl,-S" \
-DCMAKE_CXX_FLAGS="-O2 -Wl,-S" \
&& make -j $(nproc) --silent && make install \
&& rm -rf /tmp/geos
ENV PROJ_VERSION=9.2.0
RUN mkdir /tmp/proj && mkdir /tmp/proj/data \
&& curl -sfL https://github.com/OSGeo/proj/archive/${PROJ_VERSION}.tar.gz | tar zxf - -C /tmp/proj --strip-components=1 \
&& cd /tmp/proj \
&& mkdir build && cd build \
&& cmake3 .. \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX:PATH=$PREFIX \
-DCMAKE_INSTALL_LIBDIR:PATH=lib \
-DCMAKE_INSTALL_INCLUDEDIR:PATH=include \
-DBUILD_TESTING=OFF \
-DCMAKE_C_FLAGS="-O2 -Wl,-S" \
-DCMAKE_CXX_FLAGS="-O2 -Wl,-S" \
&& make -j $(nproc) --silent && make install \
&& rm -rf /tmp/proj
ENV ARROW_VERSION=12.0.0
RUN mkdir /tmp/arrow \
&& curl -sfL "https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/arrow-${ARROW_VERSION}/apache-arrow-${ARROW_VERSION}.tar.gz" | tar zxf - -C /tmp/arrow --strip-components=1 \
&& cd /tmp/arrow/cpp \
&& mkdir build && cd build \
&& cmake3 .. \
-DCMAKE_INSTALL_PREFIX=$PREFIX \
-DCMAKE_PREFIX_PATH=$PREFIX \
-DCMAKE_INSTALL_LIBDIR=lib \
-Dxsimd_SOURCE=BUNDLED \
-DCMAKE_C_FLAGS="-O2 -Wl,-S" \
-DCMAKE_CXX_FLAGS="-O2 -Wl,-S" \
-DARROW_BUILD_TESTS=OFF \
-DARROW_PARQUET=ON \
&& make -j $(nproc) --silent && make install \
&& rm -rf /tmp/arrow
# We use commit sha to make sure we are not using `cache` when building the docker image
# "7ca88116f5a46d429251361634eb24629f315076" is the latest commit on release/3.6 branch
# gdal
RUN mkdir /tmp/gdal \
&& curl -sfL https://github.com/OSGeo/gdal/archive/7ca88116f5a46d429251361634eb24629f315076.tar.gz | tar zxf - -C /tmp/gdal --strip-components=1 \
&& cd /tmp/gdal \
&& mkdir build && cd build \
&& cmake3 .. \
-DGDAL_USE_EXTERNAL_LIBS=ON \
-DCMAKE_BUILD_TYPE=MinSizeRel \
-DCMAKE_INSTALL_PREFIX:PATH=$PREFIX \
-DCMAKE_INSTALL_LIBDIR:PATH=lib \
-DCMAKE_PREFIX_PATH=lib \
-DGDAL_SET_INSTALL_RELATIVE_RPATH=ON \
-DBUILD_PYTHON_BINDINGS=OFF \
-DBUILD_TESTING=OFF \
-DCMAKE_C_FLAGS="-O2 -Wl,-S" \
-DCMAKE_CXX_FLAGS="-O2 -Wl,-S" \
-DGDAL_BUILD_OPTIONAL_DRIVERS=OFF \
-DOGR_BUILD_OPTIONAL_DRIVERS=OFF \
-DGDAL_USE_PARQUET=ON \
-DGDAL_USE_ARROW=ON \
-DOGR_ENABLE_DRIVER_ARROW=ON \
-DOGR_ENABLE_DRIVER_ARROW_PLUGIN=ON \
-DOGR_ENABLE_DRIVER_PARQUET=ON \
-DOGR_ENABLE_DRIVER_PARQUET_PLUGIN=ON \
&& make -j $(nproc) --silent && make install \
&& rm -rf /tmp/gdal
# from https://github.com/pypa/manylinux/blob/d8ef5d47433ba771fa4403fd48f352c586e06e43/docker/build_scripts/build.sh#L133-L138
# Install patchelf (latest with unreleased bug fixes)
ENV PATCHELF_VERSION=0.10
RUN mkdir /tmp/patchelf \
&& curl -sfL https://github.com/NixOS/patchelf/archive/${PATCHELF_VERSION}.tar.gz | tar zxf - -C /tmp/patchelf --strip-components=1 \
&& cd /tmp/patchelf \
&& ./bootstrap.sh \
&& ./configure \
&& make -j $(nproc) --silent && make install \
&& cd / && rm -rf /tmp/patchelf
# FIX
RUN for i in $PREFIX/bin/*; do patchelf --force-rpath --set-rpath '$ORIGIN/../lib' $i; done
# Build final image
FROM public.ecr.aws/lambda/provided:al2 as runner
ENV PREFIX=/opt
COPY --from=builder $PREFIX/lib/ $PREFIX/lib/
COPY --from=builder $PREFIX/include/ $PREFIX/include/
COPY --from=builder $PREFIX/share/ $PREFIX/share/
COPY --from=builder $PREFIX/bin/ $PREFIX/bin/
RUN export GDAL_VERSION=$(gdal-config --version)
RUN yum install -y zip binutils
# remove any unneeded files
RUN rm -rdf $PREFIX/share/doc \
&& rm -rdf $PREFIX/share/man \
&& rm -rdf $PREFIX/share/cryptopp \
&& rm -rdf $PREFIX/share/hdf*
RUN cd $PREFIX \
&& find lib/ -type f -name \*.so\* -exec strip {} \; \
&& zip -r9q --symlinks /tmp/package.zip lib \
&& zip -r9q --symlinks /tmp/package.zip share \
&& zip -r9q --symlinks /tmp/package.zip bin/gdal* bin/ogr* bin/geos* bin/arrow* bin/proj* \
&& mv /tmp/package.zip /package.zip
FROM scratch AS exporter
COPY --from=runner /package.zip .
it seems there are a lot of arrow compilation options that are set to ON
by default that could be changed https://arrow.apache.org/docs/developers/cpp/building.html#optional-components
Not proficient in C++, but running into this problem when trying to run ogr2ogr with a parquet output:
this outputs this:
file paths have been replaced.
So - diving into the Cmake flags I see this: https://github.com/OSGeo/gdal/blob/634f60a4181c9db067a64dbfdd9f2872e4992927/ogr/ogrsf_frmts/generic/ogrregisterall.cpp#L251
but don't see anything specifically disabling it in the build, so anyone who can read C++ can you tell me if outputting to parquet is possible in the version built for this image?