Closed nvartolomei closed 3 years ago
It seems that the timeout can be easily increased in lambda settings. https://aws.amazon.com/about-aws/whats-new/2018/10/aws-lambda-supports-functions-that-can-run-up-to-15-minutes/
15 minutes ought to be enough for everyone.
Yeah, and I provide a frontend for that using (e.g.) llama update-function -timeout 10m gcc
. It might be reasonable to catch the error and point at documentation, though.
I was wrong about 15 minutes being enough.
FAILED: src/CMakeFiles/dbms.dir/Interpreters/ActionsDAG.cpp.o
/usr/bin/ccache /home/nv/go/bin/llamac++ -DBOOST_ASIO_STANDALONE=1 -DLZ4_DISABLE_DEPRECATE_WARNINGS=1 -DPOCO_ENABLE_CPP11 -DPOCO_HAVE_FD_EPOLL -DPOCO_OS_FAMILY_UNIX -DUNALIGNED_OK -DUSE_JEMALLOC=0 -DUSE_REPLXX=0 -DUSE_XXHASH=1 -DWITH_COVERAGE=0 -DWITH_GZFILEOP -DX86_64 -DZLIB_COMPAT -I../contrib/sentry-native/include -Iincludes/configs -I../src -Isrc -Isrc/Core/include -I../base/glibc-compatibility/memcpy -I../base/common/.. -Ibase/common/.. -I../contrib/cityhash102/include -I../contrib/cctz/include -Icontrib/zlib-ng -I../contrib/zlib-ng -I../base/pcg-random/. -I../contrib/lz4/lib -isystem ../contrib/sparsehash-c11 -isystem ../contrib/miniselect/include -isystem ../contrib/pdqsort -isystem ../contrib/llvm/llvm/include -isystem contrib/llvm/llvm/include -isystem ../contrib/libcxx/include -isystem ../contrib/libcxxabi/include -isystem ../contrib/antlr4-runtime -isystem ../contrib/fast_float/include -isystem ../contrib/xz/src/liblzma/api -isystem ../contrib/zstd/lib -isystem ../contrib/re2 -isystem ../contrib/boost -isystem ../contrib/poco/Net/include -isystem ../contrib/poco/Foundation/include -isystem ../contrib/poco/Util/include -isystem ../contrib/poco/JSON/include -isystem ../contrib/poco/XML/include -isystem ../contrib/fmtlib-cmake/../fmtlib/include -isystem ../contrib/double-conversion -isystem ../contrib/dragonbox/include -isystem contrib/re2_st -isystem ../contrib/croaring/cpp -isystem ../contrib/croaring/include -isystem ../contrib/libdivide/. -isystem ../contrib/poco/MongoDB/include -isystem ../contrib/libc-headers/x86_64-linux-gnu -isystem ../contrib/libc-headers -fdiagnostics-color=always -fsized-deallocation -gdwarf-aranges -msse4.1 -msse4.2 -mpopcnt -fasynchronous-unwind-tables -falign-functions=32 -Wall -Wno-unused-command-line-argument -fdiagnostics-absolute-paths -fexperimental-new-pass-manager -Werror -Wextra -Wframe-larger-than=65536 -Wpedantic -Wno-vla-extension -Wno-zero-length-array -Wno-c11-extensions -Wcomma -Wconditional-uninitialized -Wcovered-switch-default -Wdeprecated -Wembedded-directive -Wempty-init-stmt -Wextra-semi-stmt -Wextra-semi -Wgnu-case-range -Winconsistent-missing-destructor-override -Wnewline-eof -Wold-style-cast -Wrange-loop-analysis -Wredundant-parens -Wreserved-id-macro -Wshadow-field -Wshadow-uncaptured-local -Wshadow -Wstring-plus-int -Wundef -Wunreachable-code-return -Wunreachable-code -Wunused-exception-parameter -Wunused-macros -Wunused-member-function -Wzero-as-null-pointer-constant -Weverything -Wno-c++98-compat-pedantic -Wno-c++98-compat -Wno-c99-extensions -Wno-conversion -Wno-ctad-maybe-unsupported -Wno-deprecated-dynamic-exception-spec -Wno-disabled-macro-expansion -Wno-documentation-unknown-command -Wno-double-promotion -Wno-exit-time-destructors -Wno-float-equal -Wno-global-constructors -Wno-missing-prototypes -Wno-missing-variable-declarations -Wno-nested-anon-types -Wno-packed -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shift-sign-overflow -Wno-sign-conversion -Wno-switch-enum -Wno-undefined-func-template -Wno-unused-template -Wno-vla -Wno-weak-template-vtables -Wno-weak-vtables -O2 -g -DNDEBUG -O3 -fno-pie -D OS_LINUX -nostdinc++ -pthread -std=gnu++20 -MD -MT src/CMakeFiles/dbms.dir/Interpreters/ActionsDAG.cpp.o -MF src/CMakeFiles/dbms.dir/Interpreters/ActionsDAG.cpp.o.d -o src/CMakeFiles/dbms.dir/Interpreters/ActionsDAG.cpp.o -c ../src/Interpreters/ActionsDAG.cpp
Running gcc: Function returned error: "{\"errorMessage\":\"2021-05-23T19:11:02.568Z fadbe765-092c-4293-befb-e3d0cfdf1336 Task timed out after 900.02 seconds\"}"
ninja: build stopped: subcommand failed.
🤔
It takes ~100 seconds on my laptop. Any secrets on how-to quickly debug why this is taking so long on Lambda?
Try giving the Lambda function more memory. I've seen a few cases where the llama default of 1769 is enough for compilation not to oom, but low enough that it starts paging like the dickens and is super slow. I'd be pretty surprised if it actually needs 15m to compile a single file...
There's also the tail chance of a weird deadlock or something in the runtime, I suppose...
I was able to reproduce this failure locally; giving the image 3GB of RAM via
llama update-function --memory=3096 clang-13
was indeed enough to fix it.
Note that increasing memory scales the cost linearly because of the Lambda pricing model, so it would be nice to have better tools to identify these, and potentially even have two versions of the function, one with a "normal" amount of memory and one with a large one, or otherwise somehow manage these situations…
Currently compilation aborts with errors like these:
Maybe it is worth allowing fallback to local compilation?