nelhage / llama

Apache License 2.0
584 stars 24 forks source link

llamacc: Handle lambda invokation timeouts #22

Closed nvartolomei closed 3 years ago

nvartolomei commented 3 years ago

Currently compilation aborts with errors like these:

Running gcc: Function returned error: "{\"errorMessage\":\"2021-05-23T17:37:04.460Z 55fe68b3-ef15-417e-94db-d2e9f6351f60 Task timed out after 60.07 seconds\"}"

Maybe it is worth allowing fallback to local compilation?

nvartolomei commented 3 years ago

It seems that the timeout can be easily increased in lambda settings. https://aws.amazon.com/about-aws/whats-new/2018/10/aws-lambda-supports-functions-that-can-run-up-to-15-minutes/

15 minutes ought to be enough for everyone.

nelhage commented 3 years ago

Yeah, and I provide a frontend for that using (e.g.) llama update-function -timeout 10m gcc. It might be reasonable to catch the error and point at documentation, though.

nvartolomei commented 3 years ago

I was wrong about 15 minutes being enough.

FAILED: src/CMakeFiles/dbms.dir/Interpreters/ActionsDAG.cpp.o
/usr/bin/ccache /home/nv/go/bin/llamac++ -DBOOST_ASIO_STANDALONE=1 -DLZ4_DISABLE_DEPRECATE_WARNINGS=1 -DPOCO_ENABLE_CPP11 -DPOCO_HAVE_FD_EPOLL -DPOCO_OS_FAMILY_UNIX -DUNALIGNED_OK -DUSE_JEMALLOC=0 -DUSE_REPLXX=0 -DUSE_XXHASH=1 -DWITH_COVERAGE=0 -DWITH_GZFILEOP -DX86_64 -DZLIB_COMPAT -I../contrib/sentry-native/include -Iincludes/configs -I../src -Isrc -Isrc/Core/include -I../base/glibc-compatibility/memcpy -I../base/common/.. -Ibase/common/.. -I../contrib/cityhash102/include -I../contrib/cctz/include -Icontrib/zlib-ng -I../contrib/zlib-ng -I../base/pcg-random/. -I../contrib/lz4/lib -isystem ../contrib/sparsehash-c11 -isystem ../contrib/miniselect/include -isystem ../contrib/pdqsort -isystem ../contrib/llvm/llvm/include -isystem contrib/llvm/llvm/include -isystem ../contrib/libcxx/include -isystem ../contrib/libcxxabi/include -isystem ../contrib/antlr4-runtime -isystem ../contrib/fast_float/include -isystem ../contrib/xz/src/liblzma/api -isystem ../contrib/zstd/lib -isystem ../contrib/re2 -isystem ../contrib/boost -isystem ../contrib/poco/Net/include -isystem ../contrib/poco/Foundation/include -isystem ../contrib/poco/Util/include -isystem ../contrib/poco/JSON/include -isystem ../contrib/poco/XML/include -isystem ../contrib/fmtlib-cmake/../fmtlib/include -isystem ../contrib/double-conversion -isystem ../contrib/dragonbox/include -isystem contrib/re2_st -isystem ../contrib/croaring/cpp -isystem ../contrib/croaring/include -isystem ../contrib/libdivide/. -isystem ../contrib/poco/MongoDB/include -isystem ../contrib/libc-headers/x86_64-linux-gnu -isystem ../contrib/libc-headers -fdiagnostics-color=always -fsized-deallocation  -gdwarf-aranges -msse4.1 -msse4.2 -mpopcnt -fasynchronous-unwind-tables -falign-functions=32   -Wall -Wno-unused-command-line-argument  -fdiagnostics-absolute-paths -fexperimental-new-pass-manager -Werror -Wextra -Wframe-larger-than=65536 -Wpedantic -Wno-vla-extension -Wno-zero-length-array -Wno-c11-extensions -Wcomma -Wconditional-uninitialized -Wcovered-switch-default -Wdeprecated -Wembedded-directive -Wempty-init-stmt -Wextra-semi-stmt -Wextra-semi -Wgnu-case-range -Winconsistent-missing-destructor-override -Wnewline-eof -Wold-style-cast -Wrange-loop-analysis -Wredundant-parens -Wreserved-id-macro -Wshadow-field -Wshadow-uncaptured-local -Wshadow -Wstring-plus-int -Wundef -Wunreachable-code-return -Wunreachable-code -Wunused-exception-parameter -Wunused-macros -Wunused-member-function -Wzero-as-null-pointer-constant -Weverything -Wno-c++98-compat-pedantic -Wno-c++98-compat -Wno-c99-extensions -Wno-conversion -Wno-ctad-maybe-unsupported -Wno-deprecated-dynamic-exception-spec -Wno-disabled-macro-expansion -Wno-documentation-unknown-command -Wno-double-promotion -Wno-exit-time-destructors -Wno-float-equal -Wno-global-constructors -Wno-missing-prototypes -Wno-missing-variable-declarations -Wno-nested-anon-types -Wno-packed -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shift-sign-overflow -Wno-sign-conversion -Wno-switch-enum -Wno-undefined-func-template -Wno-unused-template -Wno-vla -Wno-weak-template-vtables -Wno-weak-vtables -O2 -g -DNDEBUG -O3  -fno-pie   -D OS_LINUX -nostdinc++ -pthread -std=gnu++20 -MD -MT src/CMakeFiles/dbms.dir/Interpreters/ActionsDAG.cpp.o -MF src/CMakeFiles/dbms.dir/Interpreters/ActionsDAG.cpp.o.d -o src/CMakeFiles/dbms.dir/Interpreters/ActionsDAG.cpp.o -c ../src/Interpreters/ActionsDAG.cpp
Running gcc: Function returned error: "{\"errorMessage\":\"2021-05-23T19:11:02.568Z fadbe765-092c-4293-befb-e3d0cfdf1336 Task timed out after 900.02 seconds\"}"
ninja: build stopped: subcommand failed.

🤔

It takes ~100 seconds on my laptop. Any secrets on how-to quickly debug why this is taking so long on Lambda?

nelhage commented 3 years ago

Try giving the Lambda function more memory. I've seen a few cases where the llama default of 1769 is enough for compilation not to oom, but low enough that it starts paging like the dickens and is super slow. I'd be pretty surprised if it actually needs 15m to compile a single file...

There's also the tail chance of a weird deadlock or something in the runtime, I suppose...

nelhage commented 3 years ago

I was able to reproduce this failure locally; giving the image 3GB of RAM via

llama update-function --memory=3096 clang-13

was indeed enough to fix it.

Note that increasing memory scales the cost linearly because of the Lambda pricing model, so it would be nice to have better tools to identify these, and potentially even have two versions of the function, one with a "normal" amount of memory and one with a large one, or otherwise somehow manage these situations…