Failure to load sentence transformer model inside a docker container (M2 Mac works as expected)

ramarnat commented 1 year ago

For some reason I am unable to load a sentence transformer model in docker, this works fine running it directly on my mac m2, but does not work in a docker container. I initially thought it was the libtorch installation, but the same issue persists even if libtorch is built from scratch. The issue happens on an intel mac running docker as well.

Hopefully I have provided enough here to see or recreate the issue - maybe it's something obvious, if not I can create a repo. Any ideas or thoughts would be appreciated.

On the mac it works as expected:

RUST_LOG=debug cargo run -- --watch
   Compiling rust_bert_vector_api v0.1.0 
    Finished dev [unoptimized + debuginfo] target(s) in 1.86s
     Running `target/debug/rust_bert_vector_api --watch`
[2023-06-01T03:07:34Z INFO  rust_bert_vector_api] creating model
[2023-06-01T03:07:34Z INFO  rust_bert_vector_api] Model Loaded
[2023-06-01T03:07:34Z INFO  rust_bert_vector_api] finished creating model

main.rs

use anyhow::{Error, Result};
use log::{error, info};
use rust_bert::pipelines::sentence_embeddings::SentenceEmbeddingsBuilder;

fn main() -> Result<(), Error> {
    env_logger::init();
    info!("creating model");
    let _model = SentenceEmbeddingsBuilder::local("resources/multi-qa-MiniLM-L6-cos-v1")
        .with_device(tch::Device::cuda_if_available());
    info!("Model Loaded");
    let model_result = _model.create_model();

    let _model = match model_result {
        Ok(model) => {
            info!("finished creating model");
            model
        }
        Err(rust_bert_error) => {
            error!("Error creating model: {:?}", rust_bert_error);
            return Ok(());
        }
    };

    Ok(())
}

Cargo.toml

[package]
name = "rust_bert_vector_api"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
rust-bert = { git = "https://github.com/guillaume-be/rust-bert" }
tch = "0.13.0"
log = "0.4"
env_logger = "0.8"
anyhow = "1"

[profile.release]
panic = 'abort'

Dockerfile

FROM rust:latest as build

WORKDIR /build
RUN apt-get update && apt-get install -y \
        git \
        git-lfs \
        caffe \
        build-essential \
        gcc \
        cmake  \
        python \
        pip && \
    mkdir resources

RUN git clone --depth 1 -b v2.0.0 --recurse-submodule https://github.com/pytorch/pytorch.git

RUN pip3 install --no-cache-dir -r pytorch/requirements.txt && \
    mkdir pytorch-build && \
    cd pytorch-build && \
    cmake -DBUILD_SHARED_LIBS:BOOL=ON -DCMAKE_BUILD_TYPE:STRING=Release -DPYTHON_EXECUTABLE:PATH=`which python3` -DCMAKE_INSTALL_PREFIX:PATH=../pytorch-install ../pytorch && \
    cmake --build . --target install

FROM rust:latest as rust-build

COPY --from=build /build/pytorch-install /opt/pytorch
ENV LIBTORCH=/opt/pytorch
ENV LD_LIBRARY_PATH=/opt/pytorch/lib:/usr/lib/aarch64-linux-gnu/:$LD_LIBRARY_PATH
RUN USER=root cargo new --bin app
WORKDIR /app

RUN apt-get update && apt-get install -y \
        git \
        git-lfs \
        caffe \
        python \
        pip && \
    mkdir resources

# This is the same utils directory from the rust-bert repo
COPY utils utils
# copy over the manifest
COPY ./Cargo.toml ./Cargo.toml

RUN git -C resources clone https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1 && \
    pip3 install --no-cache-dir -r utils/requirements.txt
RUN python3 ./utils/convert_model.py resources/multi-qa-MiniLM-L6-cos-v1/pytorch_model.bin

ENV RUST_LOG=info
ENV RUST_BACKTRACE=full
COPY ./src ./src
RUN cargo build --release &&   cp ./target/release/rust_bert* .
CMD ['./rust_bert_vector_api']

The failure

[2023-06-01T02:38:11Z INFO  rust_bert_vector_api] creating model
[2023-06-01T02:38:11Z INFO  rust_bert_vector_api] Model Loaded
[2023-06-01T02:38:11Z ERROR rust_bert_vector_api] Error creating model: -- snip<error formatted better below>

Internal torch error: open file failed because of errno 2 on fopen: , file path: resources/multi-qa-MiniLM-L6-cos-v1/rust_model.ot
Exception raised from RAIIFile at /build/pytorch/caffe2/serialize/file_adapter.cc:21 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x84 (0xffffb8b83e44 in /opt/pytorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe4 (0xffffb8b4e528 in /opt/pytorch/lib/libc10.so)
frame #2: caffe2::serialize::FileAdapter::RAIIFile::RAIIFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x128 (0xffffbb5e40b8 in /opt/pytorch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::FileAdapter::FileAdapter(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x34 (0xffffbb5e4134 in /opt/pytorch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x84 (0xffffbb5e2fd4 in /opt/pytorch/lib/libtorch_cpu.so)
frame #5: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool, bool) + 0x20c (0xffffbc4c57ac in /opt/pytorch/lib/libtorch_cpu.so)
frame #6: torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, bool) + 0x6c (0xffffbc4c5adc in /opt/pytorch/lib/libtorch_cpu.so)
frame #7: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, bool) + 0xbc (0xffffbc4c5c00 in /opt/pytorch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x382b28 (0xaaaad4d02b28 in ./rust_bert_vector_api)
frame #9: <unknown function> + 0x1994a0 (0xaaaad4b194a0 in ./rust_bert_vector_api)
frame #10: <unknown function> + 0x19d304 (0xaaaad4b1d304 in ./rust_bert_vector_api)
frame #11: <unknown function> + 0x192c84 (0xaaaad4b12c84 in ./rust_bert_vector_api)

ramarnat commented 1 year ago

Ok I didn't check the obvious thing - whether the rust_model.ot file was created and exists. It does not but the convert_model step did not fail. One of the failures during testing yesterday was that the convert failed because libtorch_cpu.so was missing ABI symbols causing the convert to fail, which is why I built libtorch from scratch. That didn't show up as an error after the build, so I didn't think to check the ot file.

Anyway I am rechecking that step now to see what might be causing that behavior and will report back shortly.

ramarnat commented 1 year ago

This now works, my updated docker section:

ENV CARGO_HOME=/opt/cargo
ENV PATH=/opt/cargo/bin:$PATH
RUN git -C resources clone https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1 && \
    rm -rf resources/multi-qa-MiniLM-L6-cos-v1/.git
RUN git clone https://github.com/guillaume-be/rust-bert.git && \
    cd rust-bert && \
    pip3 install --no-cache-dir -r requirements.txt

# 1. convert the model to rust
# 2. Make sure the rust model was created
# 3. remove the python models to save space
RUN python3 rust-bert/utils/convert_model.py resources/multi-qa-MiniLM-L6-cos-v1/pytorch_model.bin && \
     [ ! -f resources/multi-qa-MiniLM-L6-cos-v1/rust_model.ot ] && exit 1 || \
    rm resources/multi-qa-MiniLM-L6-cos-v1/model.npz && \
    rm resources/multi-qa-MiniLM-L6-cos-v1/pytorch_model.bin && \
    rm resources/multi-qa-MiniLM-L6-cos-v1/tf_model.h5

I guess my only request would be to see if convert_model.py, could fail out if convert_tensor fails.

linkedlist771 commented 5 months ago

This is what I have been looking fore , thank you!

guillaume-be / rust-bert

Failure to load sentence transformer model inside a docker container (M2 Mac works as expected) #386