mistralai / mistral-inference

Official inference library for Mistral models
https://mistral.ai/
Apache License 2.0
9.64k stars 850 forks source link

repeated build failure #77

Open juanmf opened 10 months ago

juanmf commented 10 months ago

The build process hasn't ended successfully several times.

1st (solved?): https://github.com/mistralai/mistral-src/issues/76

2nd: (seems network issue)

 => [4/8] RUN pip3 install "torch>=2.0.0"                                                                                                                                                            190.5s
 => [5/8] RUN git clone https://github.com/NVIDIA/apex &&     cd apex && git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 &&     sed -i '/check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)/d'  2139.1s 
 => => # [2/14] c++ -MMD -MF /workspace/apex/build/temp.linux-x86_64-3.10/csrc/amp_C_frontend.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Wer 
 => => # ror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.10/dist-packages/torch/include -I/u 
 => => # sr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC  
 => => # -I/usr/local/cuda/include -I/usr/include/python3.10 -c -c /workspace/apex/csrc/amp_C_frontend.cpp -o /workspace/apex/build/temp.linux-x86_64-3.10/csrc/amp_C_frontend.o -O3 -DVERSION_GE_1_1 -DVER 
 => => # SION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=amp_C  
 => => # -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17                                                                                                                                                             
ERROR: failed to receive status: rpc error: code = Unavailable desc = error reading from server: EOF

3rd: (Error compiling objects)

 => CACHED [4/8] RUN pip3 install "torch>=2.0.0"                                                                                                                                                       0.0s
 => [5/8] RUN git clone https://github.com/NVIDIA/apex &&     cd apex && git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 &&     sed -i '/check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)/d'  6578.5s
 => => #     _write_ninja_file_and_compile_objects(                                                                                                                                                        
 => => #   File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1774, in _write_ninja_file_and_compile_objects                                                                
 => => #     _run_ninja_build(                                                                                                                                                                             
 => => #   File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build                                                                                     
 => => #     raise RuntimeError(message) from e                                                                                                                                                            
 => => # RuntimeError: Error compiling objects for extension  

4th (Killed signal terminated program cc1plus):

 => CACHED [4/8] RUN pip3 install "torch>=2.0.0"                                                                                                                                                       0.0s
 => [5/8] RUN git clone https://github.com/NVIDIA/apex &&     cd apex && git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82 &&     sed -i '/check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)/d  10182.9s
 => => # l/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/
 => => # local/cuda/include -I/usr/include/python3.10 -c -c /workspace/apex/csrc/amp_C_frontend.cpp -o /workspace/apex/build/temp.linux-x86_64-3.10/csrc/amp_C_frontend.o -O3 -DVERSION_GE_1_1 -DVERSION_GE
 => => # _1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=amp_C -D_GLIB
 => => # CXX_USE_CXX11_ABI=0 -std=c++17                                                                                                                                                                    
 => => # c++: fatal error: Killed signal terminated program cc1plus                                                                                                                                        
 => => # compilation terminated.                                                                                                                                                                           
ERROR: failed to solve: Canceled: context canceled

It fails randomly at different stages. After long time compiling. I pressed ctrl+C to return to prompt though after reading above messages.

Any help will be appreciated. Thanks