pytorch / torchchat

Run PyTorch LLMs locally on servers, desktop and mobile
BSD 3-Clause "New" or "Revised" License
3.1k stars 193 forks source link

AOTI/DSO model does not run in Linux #996

Open lhl opened 1 month ago

lhl commented 1 month ago

🐛 Describe the bug

I am running an Arch Linux system with a 4090/3090 w/ and up-to-date CUDA 12.5 (Build cuda_12.5.r12.5/compiler.34385749_0)

I have created a new mamba env for torchchat and run the install. Regular inferencing (eg with generate) works fine.

I compile an AOTI model per the README:

❯ time python3 torchchat.py export llama3.1 --output-dso-path exportedModels/llama3.1.so
/home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torchao/ops.py:12: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  return torch.library.impl_abstract(f"{name}")(func)
Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
NumExpr defaulting to 16 threads.
PyTorch version 2.4.0 available.
Using device=cuda
Loading model...
Time to load model: 2.54 seconds
-----------------------------------------------------------
Exporting model using AOT Inductor to /home/local/torchchat/exportedModels/llama3.1.so
W0802 22:25:40.607000 126075654027072 torch/fx/experimental/symbolic_shapes.py:4449] xindex is not in var_ranges, defaulting to unknown range.
In file included from /home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/include/ATen/core/IListRef.h:631,
                 from /home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/include/ATen/DeviceGuard.h:3,
                 from /home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/include/ATen/ATen.h:9,
                 from /home/local/torchchat/exportedModels/ca5ydbysfhhoy7a5vyb5c26c642lglqngoqmpxtzrmq77e6kbqqx.cpp:443:
/home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/include/ATen/core/IListRef_inl.h: In static member function ‘static c10::detail::IListRefConstRef<at::OptionalTensorRef> c10::detail::IListRefTagImpl<c10::IListRefTag::Boxed, at::OptionalTensorRef>::iterator_get(const c10::List<std::optional<at::Tensor> >::const_iterator&)’:
/home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/include/ATen/core/IListRef_inl.h:171:17: warning: possibly dangling reference to a temporary [-Wdangling-reference]
  171 |     const auto& ivalue = (*it).get();
      |                 ^~~~~~
/home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/include/ATen/core/IListRef_inl.h:171:35: note: the temporary was destroyed at the end of the full expression ‘(& it)->c10::impl::ListIterator<std::optional<at::Tensor>, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator*().c10::impl::ListElementReference<std::optional<at::Tensor>, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::get()’
  171 |     const auto& ivalue = (*it).get();
      |                          ~~~~~~~~~^~
In file included from /home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/include/ATen/core/dispatch/OperatorEntry.h:12,
                 from /home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/include/ATen/core/dispatch/Dispatcher.h:6,
                 from /home/local/torchchat/exportedModels/ca5ydbysfhhoy7a5vyb5c26c642lglqngoqmpxtzrmq77e6kbqqx.cpp:444:
/home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/include/ATen/core/dispatch/DispatchKeyExtractor.h: In lambda function:
/home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/include/ATen/core/dispatch/DispatchKeyExtractor.h:154:32: warning: possibly dangling reference to a temporary [-Wdangling-reference]
  154 |         for (const at::Tensor& tensor : ivalue.toTensorList()) {
      |                                ^~~~~~
/home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/include/ATen/core/dispatch/DispatchKeyExtractor.h:154:61: note: the temporary was destroyed at the end of the full expression ‘__for_begin .c10::impl::ListIterator<at::Tensor, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator*().c10::impl::ListElementReference<at::Tensor, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue> > >::operator std::conditional_t<true, const at::Tensor&, at::Tensor>()’
  154 |         for (const at::Tensor& tensor : ivalue.toTensorList()) {
      |                                                             ^
The generated DSO model can be found at: /home/local/torchchat/exportedModels/llama3.1.so

real    2m2.058s
user    1m24.277s
sys     0m39.165s

When I try to run with the exported DSO model it gives an error:

 python3 torchchat.py generate llama3.1 --dso-path exportedModels/llama3.1.so --prompt "Hello my name is"
/home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torchao/ops.py:12: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  return torch.library.impl_abstract(f"{name}")(func)
Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
NumExpr defaulting to 16 threads.
PyTorch version 2.4.0 available.
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Using device=cuda NVIDIA GeForce RTX 4090
Loading model...
Time to load model: 2.65 seconds
Error: CUDA error: out of memory
Traceback (most recent call last):
  File "/home/local/torchchat/build/builder.py", line 468, in _initialize_model
    model.forward = torch._export.aot_load(
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/_export/__init__.py", line 425, in aot_load
    runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device)  # type: ignore[assignment, call-arg]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: create_func_( &container_handle_, num_models, device_str.c_str(), cubin_dir.empty() ? nullptr : cubin_dir.c_str()) API call failed at ../torch/csrc/inductor/aoti_runner/model_container_runner.cpp, line 49

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/local/torchchat/torchchat.py", line 88, in <module>
    generate_main(args)
  File "/home/local/torchchat/generate.py", line 838, in main
    gen = Generator(
          ^^^^^^^^^^
  File "/home/local/torchchat/generate.py", line 205, in __init__
    self.model = _initialize_model(self.builder_args, self.quantize, self.tokenizer)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/local/torchchat/build/builder.py", line 472, in _initialize_model
    raise RuntimeError(f"Failed to load AOTI compiled {builder_args.dso_path}")
RuntimeError: Failed to load AOTI compiled exportedModels/llama3.1.so

I tried the C++ runner as well but it fails to build:

❯ scripts/build_native.sh aoti
+ '[' 1 -eq 0 ']'
+ ((  1  ))
+ case "$1" in
+ echo 'Building aoti native runner...'
Building aoti native runner...
+ TARGET=aoti
+ shift
+ ((  0  ))
+ '[' -z '' ']'
+++ dirname scripts/build_native.sh
++ cd scripts
++ pwd -P
+ SCRIPT_PATH=/home/local/torchchat/scripts
++ dirname /home/local/torchchat/scripts
+ TORCHCHAT_ROOT=/home/local/torchchat
+ '[' -z '' ']'
+ ET_BUILD_DIR=et-build
+ source /home/local/torchchat/scripts/install_utils.sh
++ set -ex pipefail
++ COMMON_CMAKE_ARGS='    -DCMAKE_BUILD_TYPE=Release     -DEXECUTORCH_ENABLE_LOGGING=ON     -DEXECUTORCH_LOG_LEVEL=Info     -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON     -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON     -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON     -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON     -DEXECUTORCH_BUILD_XNNPACK=ON'
+ pushd /home/local/torchchat
~/torchchat ~/torchchat
+ git submodule update --init
Submodule 'tokenizer/third-party/abseil-cpp' (https://github.com/abseil/abseil-cpp.git) registered for path 'tokenizer/third-party/abseil-cpp'
Submodule 'tokenizer/third-party/re2' (https://github.com/google/re2.git) registered for path 'tokenizer/third-party/re2'
Submodule 'tokenizer/third-party/sentencepiece' (https://github.com/google/sentencepiece.git) registered for path 'tokenizer/third-party/sentencepiece'
Cloning into '/home/local/torchchat/tokenizer/third-party/abseil-cpp'...
Cloning into '/home/local/torchchat/tokenizer/third-party/re2'...
Cloning into '/home/local/torchchat/tokenizer/third-party/sentencepiece'...
Submodule path 'tokenizer/third-party/abseil-cpp': checked out '854193071498f330b71083d7e06a7cd18e02a4cc'
Submodule path 'tokenizer/third-party/re2': checked out 'ac82d4f628a2045d89964ae11c48403d3b091af1'
Submodule path 'tokenizer/third-party/sentencepiece': checked out '7dcb541451b1862d73f473b3804ccf8f2a9e10f6'
+ git submodule sync
Synchronizing submodule url for 'tokenizer/third-party/abseil-cpp'
Synchronizing submodule url for 'tokenizer/third-party/re2'
Synchronizing submodule url for 'tokenizer/third-party/sentencepiece'
+ [[ aoti == \e\t ]]
+ popd
~/torchchat
+ [[ aoti == \e\t ]]
++ python3 -c 'import torch;print(torch.utils.cmake_prefix_path)'
+ cmake -S . -B ./cmake-out -DCMAKE_PREFIX_PATH=/home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/share/cmake -DCMAKE_CXX_FLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 -G Ninja
-- The C compiler identification is GNU 14.1.1
-- The CXX compiler identification is GNU 14.1.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test ABSL_INTERNAL_AT_LEAST_CXX17
-- Performing Test ABSL_INTERNAL_AT_LEAST_CXX17 - Success
-- Performing Test ABSL_INTERNAL_AT_LEAST_CXX20
-- Performing Test ABSL_INTERNAL_AT_LEAST_CXX20 - Failed
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
CMake Warning at tokenizer/third-party/abseil-cpp/CMakeLists.txt:193 (message):
    The default and system-level install directories are unsupported except in LTS   releases of Abseil.  Please set CMAKE_INSTALL_PREFIX to install Abseil in your   source or build tree directly.

CMake Deprecation Warning at tokenizer/third-party/sentencepiece/CMakeLists.txt:15 (cmake_minimum_required):
  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

-- VERSION: 0.2.1
-- Found TCMalloc: /usr/lib/libtcmalloc_minimal.so
-- Using ET BUILD DIR: --[et-build]--
-- TORCHCHAT_ROOT="/home/local/torchchat"
-- Looking for excutorch in /home/local/torchchat/et-build/install
-- Could NOT find executorch (missing: executorch_DIR)
CMake Warning at runner/et.cmake:130 (MESSAGE):
  ExecuTorch package not found
Call Stack (most recent call first):
  CMakeLists.txt:15 (include)

CMake Warning (dev) at runner/aoti.cmake:16 (find_package):
  Policy CMP0146 is not set: The FindCUDA module is removed.  Run "cmake
  --help-policy CMP0146" for policy details.  Use the cmake_policy command to
  set the policy and suppress this warning.

Call Stack (most recent call first):
  CMakeLists.txt:21 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found CUDA: /opt/cuda (found version "12.5")
-- Found CUDA: /opt/cuda (found version "12.5")
-- The CUDA compiler identification is NVIDIA 12.5.82
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /opt/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /opt/cuda/include (found version "12.5.82")
-- Caffe2: CUDA detected: 12.5
-- Caffe2: CUDA nvcc is: /opt/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /opt/cuda
-- Caffe2: Header version is: 12.5
-- /opt/cuda/lib/libnvrtc.so shorthash is a50b0e02
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support
-- Autodetected CUDA architecture(s):  8.9 8.6
-- Added CUDA NVCC flags for: -gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_86,code=sm_86
CMake Warning at /home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
  static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
  /home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:120 (append_torchlib_if_found)
  runner/aoti.cmake:18 (find_package)
  CMakeLists.txt:21 (include)

-- Found Torch: /home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torch/lib/libtorch.so (Required is at least version "2.4.0")
-- Configuring done (4.2s)
-- Generating done (0.1s)
-- Build files have been written to: /home/local/torchchat/cmake-out
+ cmake --build ./cmake-out --target aoti_run
[63/222] Building CXX object tokenizer/CMakeFiles/tokenizer.dir/tiktoken.cpp.o
FAILED: tokenizer/CMakeFiles/tokenizer.dir/tiktoken.cpp.o
/usr/bin/c++  -I/home/local/torchchat/tokenizer -I/home/local/torchchat/tokenizer/third-party/sentencepiece/src -I/home/local/torchchat/tokenizer/third-party/re2 -I/home/local/torchchat/tokenizer/third-party/abseil-cpp -D_GLIBCXX_USE_CXX11_ABI=0 -MD -MT tokenizer/CMakeFiles/tokenizer.dir/tiktoken.cpp.o -MF tokenizer/CMakeFiles/tokenizer.dir/tiktoken.cpp.o.d -o tokenizer/CMakeFiles/tokenizer.dir/tiktoken.cpp.o -c /home/local/torchchat/tokenizer/tiktoken.cpp
In file included from /home/local/torchchat/tokenizer/tiktoken.cpp:18:
/home/local/torchchat/tokenizer/base64.h:37:11: error: ‘uint32_t’ does not name a type
   37 | constexpr uint32_t DECODE_TABLE[] = {
      |           ^~~~~~~~
/home/local/torchchat/tokenizer/base64.h:29:1: note: ‘uint32_t’ is defined in header ‘<cstdint>’; this is probably fixable by adding ‘#include <cstdint>’
   28 | #include <string>
  +++ |+#include <cstdint>
   29 | #include <string_view>
/home/local/torchchat/tokenizer/base64.h:57:13: error: variable or field ‘validate’ declared void
   57 | inline void validate(uint32_t v) {
      |             ^~~~~~~~
/home/local/torchchat/tokenizer/base64.h:57:22: error: ‘uint32_t’ was not declared in this scope
   57 | inline void validate(uint32_t v) {
      |                      ^~~~~~~~
/home/local/torchchat/tokenizer/base64.h:57:22: note: ‘uint32_t’ is defined in header ‘<cstdint>’; this is probably fixable by adding ‘#include <cstdint>’
/home/local/torchchat/tokenizer/base64.h: In function ‘void base64::detail::decode(const std::string_view&, std::string&)’:
/home/local/torchchat/tokenizer/base64.h:70:3: error: ‘uint32_t’ was not declared in this scope
   70 |   uint32_t val = 0;
      |   ^~~~~~~~
/home/local/torchchat/tokenizer/base64.h:70:3: note: ‘uint32_t’ is defined in header ‘<cstdint>’; this is probably fixable by adding ‘#include <cstdint>’
/home/local/torchchat/tokenizer/base64.h:72:3: error: ‘uint8_t’ was not declared in this scope
   72 |   uint8_t c = input[0];
      |   ^~~~~~~
/home/local/torchchat/tokenizer/base64.h:72:3: note: ‘uint8_t’ is defined in header ‘<cstdint>’; this is probably fixable by adding ‘#include <cstdint>’
/home/local/torchchat/tokenizer/base64.h:73:12: error: ‘DECODE_TABLE’ was not declared in this scope
   73 |   auto v = DECODE_TABLE[c];
      |            ^~~~~~~~~~~~
/home/local/torchchat/tokenizer/base64.h:73:25: error: ‘c’ was not declared in this scope
   73 |   auto v = DECODE_TABLE[c];
      |                         ^
/home/local/torchchat/tokenizer/base64.h:74:3: error: ‘validate’ was not declared in this scope
   74 |   validate(v);
      |   ^~~~~~~~
/home/local/torchchat/tokenizer/base64.h:75:3: error: ‘val’ was not declared in this scope
   75 |   val = v;
      |   ^~~
/home/local/torchchat/tokenizer/base64.h: In function ‘void base64::detail::decode_1_padding(const std::string_view&, std::string&)’:
/home/local/torchchat/tokenizer/base64.h:105:3: error: ‘uint32_t’ was not declared in this scope
  105 |   uint32_t val = 0;
      |   ^~~~~~~~
/home/local/torchchat/tokenizer/base64.h:105:3: note: ‘uint32_t’ is defined in header ‘<cstdint>’; this is probably fixable by adding ‘#include <cstdint>’
/home/local/torchchat/tokenizer/base64.h:107:3: error: ‘uint8_t’ was not declared in this scope
  107 |   uint8_t c = input[0];
      |   ^~~~~~~
/home/local/torchchat/tokenizer/base64.h:107:3: note: ‘uint8_t’ is defined in header ‘<cstdint>’; this is probably fixable by adding ‘#include <cstdint>’
/home/local/torchchat/tokenizer/base64.h:108:12: error: ‘DECODE_TABLE’ was not declared in this scope
  108 |   auto v = DECODE_TABLE[c];
      |            ^~~~~~~~~~~~
/home/local/torchchat/tokenizer/base64.h:108:25: error: ‘c’ was not declared in this scope
  108 |   auto v = DECODE_TABLE[c];
      |                         ^
/home/local/torchchat/tokenizer/base64.h:109:3: error: ‘validate’ was not declared in this scope
  109 |   validate(v);
      |   ^~~~~~~~
/home/local/torchchat/tokenizer/base64.h:110:3: error: ‘val’ was not declared in this scope
  110 |   val = v;
      |   ^~~
/home/local/torchchat/tokenizer/base64.h: In function ‘void base64::detail::decode_2_padding(const std::string_view&, std::string&)’:
/home/local/torchchat/tokenizer/base64.h:131:3: error: ‘uint32_t’ was not declared in this scope
  131 |   uint32_t val = 0;
      |   ^~~~~~~~
/home/local/torchchat/tokenizer/base64.h:131:3: note: ‘uint32_t’ is defined in header ‘<cstdint>’; this is probably fixable by adding ‘#include <cstdint>’
/home/local/torchchat/tokenizer/base64.h:133:3: error: ‘uint8_t’ was not declared in this scope
  133 |   uint8_t c = input[0];
      |   ^~~~~~~
/home/local/torchchat/tokenizer/base64.h:133:3: note: ‘uint8_t’ is defined in header ‘<cstdint>’; this is probably fixable by adding ‘#include <cstdint>’
/home/local/torchchat/tokenizer/base64.h:134:12: error: ‘DECODE_TABLE’ was not declared in this scope
  134 |   auto v = DECODE_TABLE[c];
      |            ^~~~~~~~~~~~
/home/local/torchchat/tokenizer/base64.h:134:25: error: ‘c’ was not declared in this scope
  134 |   auto v = DECODE_TABLE[c];
      |                         ^
/home/local/torchchat/tokenizer/base64.h:135:3: error: ‘validate’ was not declared in this scope
  135 |   validate(v);
      |   ^~~~~~~~
/home/local/torchchat/tokenizer/base64.h:136:3: error: ‘val’ was not declared in this scope
  136 |   val = v;
      |   ^~~
[96/222] Building CXX object CMakeFiles/aoti_run.dir/runner/run.cpp.o
ninja: build stopped: subcommand failed.

Versions

Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Arch Linux (x86_64)
GCC version: (GCC) 14.1.1 20240720
Clang version: 18.1.8
CMake version: version 3.30.1
Libc version: glibc-2.40

Python version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.10.0-arch1-2-x86_64-with-glibc2.40
Is CUDA available: True
CUDA runtime version: 12.5.82
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 4090

Nvidia driver version: 555.58.02
cuDNN version: Probably one of the following:
/usr/lib/libcudnn.so.9.2.1
/usr/lib/libcudnn_adv.so.9.2.1
/usr/lib/libcudnn_cnn.so.9.2.1
/usr/lib/libcudnn_engines_precompiled.so.9.2.1
/usr/lib/libcudnn_engines_runtime_compiled.so.9.2.1
/usr/lib/libcudnn_graph.so.9.2.1
/usr/lib/libcudnn_heuristic.so.9.2.1
/usr/lib/libcudnn_ops.so.9.2.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen 9 5950X 16-Core Processor
CPU family:                           25
Model:                                33
Thread(s) per core:                   2
Core(s) per socket:                   16
Socket(s):                            1
Stepping:                             0
Frequency boost:                      enabled
CPU(s) scaling MHz:                   69%
CPU max MHz:                          5083.3979
CPU min MHz:                          2200.0000
BogoMIPS:                             6802.30
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
L1d cache:                            512 KiB (16 instances)
L1i cache:                            512 KiB (16 instances)
L2 cache:                             8 MiB (16 instances)
L3 cache:                             64 MiB (2 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-31
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pytorch-triton==3.0.0+dedb7bdf33
[pip3] torch==2.4.0
[pip3] torchao==0.3.1
[pip3] torchaudio==2.4.0
[pip3] torchvideo==0.0.0
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] pytorch-triton            3.0.0+dedb7bdf33          pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torchao                   0.3.1                    pypi_0    pypi
[conda] torchaudio                2.4.0                    pypi_0    pypi
[conda] torchvideo                0.0.0                    pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
Jack-Khuu commented 1 month ago

Thanks for testing out the repo @lhl!

Looks like we're hitting a Error: CUDA error: out of memory here

Can you check exporting/generating with the stories15M model to verify that the behavior itself is working?

lhl commented 1 month ago

Looks like stories15M works:

❯  python3 torchchat.py generate stories15M --dso-path exportedModels/stories15M.so --prompt "Hello my name is"
/home/local/.conda/envs/torchchat/lib/python3.11/site-packages/torchao/ops.py:12: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  return torch.library.impl_abstract(f"{name}")(func)
Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
NumExpr defaulting to 16 threads.
PyTorch version 2.4.0 available.
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Using device=cuda NVIDIA GeForce RTX 4090
Loading model...
Time to load model: 0.17 seconds
-----------------------------------------------------------
Hello my name is Billy. He is three years old and very curious. He likes to explore new places.
One day, he was walking in the forest when he saw a big, scary bear. He was so scared he wanted to run away, but he couldn't move. Suddenly, he remembered his grandmother's advice, "If you ever be scared, just blink your eyes in life."
Billy blinked his eyes again and the bear blinked back in surprise. The bear started to walk away, but Billy was still scared.
Suddenly, he remembered what his grandmother had said: "If you blink a little bit, the bear won't be mean, but the most important thing is to keep exploring."
Billy knew he had to be brave, so he blinked his eyes. To his surprise, the bear was just a big, friendly bear! It had been
Time for inference 1: 0.72 sec total, time to first token 0.15 sec with sequential prefill, 199 tokens, 278.05 tokens/sec, 3.60 ms/token
Bandwidth achieved: 13.57 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================

Average tokens/sec: 278.05
Memory used: 0.10 GB

(scripts/build_native.sh aoti still fails but looks like a different bug)

the 4090 has the full 24GB of VRAM so should have no problems fitting an 8B model. It just occurred to me that the issue might be because of Llama 3.1 - since compiled it might want 128K context - limiting the tokens to --max-new-tokens 2048 still results in the CUDA OOM so maybe there needs to be an option for specifying token limts for the compiled model.

BTW speaking of --compile, I get these errors when I run the torch compile mode and try to generate:

W0804 22:57:20.187000 140149025421120 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] xindex is not in var_ranges, defaulting to unknown range.
(stalls after generating some tokens)
W0804 22:58:16.894000 140149025421120 torch/fx/experimental/symbolic_shapes.py:4449] [0/1] xindex is not in var_ranges, defaulting to unknown range.

--compile-prefill does not have these errors (but is no faster than not compiling at all):

Time for inference 1: 4.95 sec total, time to first token 0.26 sec with parallel prefill, 199 tokens, 40.17 tokens/sec, 24.89 ms/token
Bandwidth achieved: 645.19 GB/s
...
Average tokens/sec: 40.17
Memory used: 16.30 GB

I have work/deadlines/travel so I won't be able to really followup further, I'm assuming anyone doing basic testing is probably going to run into similar issues, my config (clean mamba env on a 4090 seems as vanilla a setup as possible).

sunshinesfbay commented 1 month ago

I had the same C++ runner issue building runner for ET/PTE models in #985