Open rickycorte opened 3 years ago
Could you verify this is not a TensorFlow issue? It's extremely unlikely for TFRS to cause this as it has no compiled code.
I've tested my tensorflow installation by running other models with no issues. Investigating further I've discovered that probably the crash originates from the scann library that is loaded on the import. I installed scann because I was following this guide to play around and test things: https://www.tensorflow.org/recommenders/examples/basic_retrieval
By setting PYTHONFAULTHANDLER="1" i got this stack trace here: crash_log I believe that at some point when including scann there is a load that fails due to a missing low level library or maybe a missing symbol. I've also tried to compile manually scann without any result. Uninstalling scann solves the crashes. Maybe in the next few days I'll try to compile tensorflow and then scann and see if it still crashes.
Thanks!
I'm guessing this is because the ScaNN wheels aren't built for Python 3.8. @sammymax does that ring a bell?
I've also tried with anaconda and python 3.7 but i still get the same result scann_crash.txt
Hey thanks for reporting this bug! Can you provide the CPU model and operating system you're using? This kind of crash generally comes when the CPU tries to execute some vectorized instruction (AVX2, AVX, etc.) that the CPU in fact doesn't support. I'll be able to reproduce the issue a lot more easily once I know your OS and CPU details.
I'm using an i7-3770k that should support avx as stated on the intel page. I'm running on ubuntu 20.04.2 lts
I've been trying to reproduce this issue but I haven't been able to. I'm also using Ubuntu 20.04.2 LTS with an Ivy Bridge-era CPU (AVX but not AVX2 support). Are you using the system Python 3.8 or one from somewhere else (like pyenv)?
I'm using the system python for version 3.8. I've also tried with python 3.7 with coda and still have the same kind of issue.
I installed Conda with Python 3.7 and I also couldn't reproduce. I installed Anaconda 2021.05 from here and then did
conda create -n py37 python=3.7 anaconda
conda activate py37
pip install scann
and the import worked fine, and ScaNN managed to train and search ok too. This was all done on an Ivy Bridge CPU that should be very similar to your i7-3770K.
I got the issue by running this to create the environment:
conda create -n py37 python=3.7 anaconda
conda activate py37
pip install tensorflow
pip install tensorflow-recommenders
pip install scann
I run export PYTHONFAULTHANDLER="1"
to see the crash stack trace
Now run python
and type import tensorflow_recommenders
. In this way i obtain a crash that have a stack trace similar to the ones I've posted before.
If i try to import directly scann:
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scann
2021-07-15 23:42:32.395084: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Fatal Python error: Illegal instruction
Current thread 0x00007fa464554740 (most recent call first):
File "/home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 58 in load_op_library
File "/home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/scann/scann_ops/py/scann_ops.py", line 26 in <module>
File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 728 in exec_module
File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 983 in _find_and_load
File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1035 in _handle_fromlist
File "/home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/scann/__init__.py", line 2 in <module>
File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 728 in exec_module
File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 983 in _find_and_load
File "<stdin>", line 1 in <module>
Illegal instruction (core dumped)
If i run pip uninstall scann
an retry to import tensroflow_recommenders
everything works fine.
Found existing installation: scann 1.2.2
Uninstalling scann-1.2.2:
Would remove:
/home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/scann-1.2.2.dist-info/*
/home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/scann/*
Proceed (y/n)? y
Successfully uninstalled scann-1.2.2
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow_recommenders
2021-07-15 23:44:38.629808: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>>
Edit: i tried to install only scann as you did and i got the same error. I'm starting to think that maybe there is something wrong with my cuda installation outside of conda. I'll try out on a virtual machine with a clean ubuntu installation without any nvidia library.
Edit 2: Tried on a vm made right now using native python 3.8, no nvidia cuda but the issue persist.
Edit 3: Tried on new vm running on top of windows 10 and still the same (no cuda on host and guest). At this point I guess its some kind of issue of my machine that is not easly reproducible.
I think I'm seeing the crash in scann:
Thread 1 "python" received signal SIGILL, Illegal instruction.
0x00007fff183f910a in google::protobuf::FieldDescriptorProto::FieldDescriptorProto() () from .../python3.10/site-packages/scann/scann_ops/cc/_scann_ops.so
0x00007fff183f9105 <+53>: vmovq %rax,%xmm0
=> 0x00007fff183f910a <+58>: vpbroadcastq %xmm0,%ymm0
0x00007fff183f910f <+63>: vmovdqu %ymm0,0x18(%rbx)
I'm pretty sure VPBROADCAST from xmm to ymm is an AVX512 instruction, which my CPU (Sandybridge) doesn't have.
Thanks for debugging--I think the issue is that vpbroadcastq
is an AVX2 instruction, which Sandy Bridge doesn't support. We will look into compiling the ScaNN wheels the next release without the -mavx2
flag so that this issue is resolved. You can try compiling ScaNN yourself without that flag in the meantime to see if that fixes the problem.
Nice, thanks! I was able to build scann without AVX2 using an older version of bazel :)
@emikulic could you post steps to build scann without AVX2 using an older version of bazel? We have troubles installing old versions of bazel
Thanks
@sammymax a docker file for the build environment would be nice to have
Here's a related Dockerfile that might help; it compiles a version of TensorFlow Serving linked against ScaNN: https://github.com/google-research/google-research/blob/master/scann/tf_serving/Dockerfile.devel
What problems have you encountered with old versions of Bazel?
This worked for me:
git clone git@github.com:google-research/google-research.git --depth=1
cd google-research/scann/
python configure.py
# get https://github.com/bazelbuild/bazelisk/releases/download/v1.12.0/bazelisk-linux-amd64
# install as "bazel"
echo 3.7.2 > .bazelversion
# note -march=native instead of -march=avx2:
CC=clang bazel build -c opt --features=thin_lto --copt=-march=native --cxxopt="-std=c++17" --copt=-fsized-deallocation --copt=-w :build_pip_pkg
./bazel-bin/build_pip_pkg
# produces scann-1.2.7-cp310-cp310-linux_x86_64.whl which you can "pip install"
@emikulic @sammymax Thank you, we have compiled scann-1.2.7 successfully.
However, export of a trained TF Lite model failed (TF 2.9.1 and scann-1.2.7). Export worked on Colab (TF 2.8.2 and scann-1.2.6)
@sammymax Is it possible to check out scann-1.2.6 branch from the repo?
As for Bazel, the sysadmin installed a 5.x version but couldn't downgrade it for some reason.
Were you able to use bazelisk to get an older version of bazel?
Yes, the sysadmin managed to install an older version of bazel.
ScaNN 1.2.8 was recently released and doesn't assume AVX2 support; we now compile with -mavx
rather than -mavx2
, and do runtime dispatch to AVX2, when supported, for the important routines. Hopefully this helps.
This is a bit more of a help for others that encounter my same error. Running tensorflow_recommenders with tensorflow 2.5 on python 3.8 kills the interpreter with a
Illegal instruction (core dumped)
when runningimport tensorflow_recommenders
.I wasted a bit of time and came up with a solution: match exactly your environment with the one running on colab. In this particular case using python 3.7 fixes the issue.
I'd suggest to state clearly the supported versions on the guides and the readme of this repository (an maybe also in the setup.py that stops at 3.6 ).