tensorflow / recommenders

TensorFlow Recommenders is a library for building recommender system models using TensorFlow.
Apache License 2.0
1.84k stars 276 forks source link

Python 3.8 illegal istruction #328

Open rickycorte opened 3 years ago

rickycorte commented 3 years ago

This is a bit more of a help for others that encounter my same error. Running tensorflow_recommenders with tensorflow 2.5 on python 3.8 kills the interpreter with a Illegal instruction (core dumped) when running import tensorflow_recommenders.

I wasted a bit of time and came up with a solution: match exactly your environment with the one running on colab. In this particular case using python 3.7 fixes the issue.

I'd suggest to state clearly the supported versions on the guides and the readme of this repository (an maybe also in the setup.py that stops at 3.6 ).

maciejkula commented 3 years ago

Could you verify this is not a TensorFlow issue? It's extremely unlikely for TFRS to cause this as it has no compiled code.

rickycorte commented 3 years ago

I've tested my tensorflow installation by running other models with no issues. Investigating further I've discovered that probably the crash originates from the scann library that is loaded on the import. I installed scann because I was following this guide to play around and test things: https://www.tensorflow.org/recommenders/examples/basic_retrieval

By setting PYTHONFAULTHANDLER="1" i got this stack trace here: crash_log I believe that at some point when including scann there is a load that fails due to a missing low level library or maybe a missing symbol. I've also tried to compile manually scann without any result. Uninstalling scann solves the crashes. Maybe in the next few days I'll try to compile tensorflow and then scann and see if it still crashes.

maciejkula commented 3 years ago

Thanks!

I'm guessing this is because the ScaNN wheels aren't built for Python 3.8. @sammymax does that ring a bell?

rickycorte commented 3 years ago

I've also tried with anaconda and python 3.7 but i still get the same result scann_crash.txt

sammymax commented 3 years ago

Hey thanks for reporting this bug! Can you provide the CPU model and operating system you're using? This kind of crash generally comes when the CPU tries to execute some vectorized instruction (AVX2, AVX, etc.) that the CPU in fact doesn't support. I'll be able to reproduce the issue a lot more easily once I know your OS and CPU details.

rickycorte commented 3 years ago

I'm using an i7-3770k that should support avx as stated on the intel page. I'm running on ubuntu 20.04.2 lts

sammymax commented 3 years ago

I've been trying to reproduce this issue but I haven't been able to. I'm also using Ubuntu 20.04.2 LTS with an Ivy Bridge-era CPU (AVX but not AVX2 support). Are you using the system Python 3.8 or one from somewhere else (like pyenv)?

rickycorte commented 3 years ago

I'm using the system python for version 3.8. I've also tried with python 3.7 with coda and still have the same kind of issue.

sammymax commented 3 years ago

I installed Conda with Python 3.7 and I also couldn't reproduce. I installed Anaconda 2021.05 from here and then did

conda create -n py37 python=3.7 anaconda
conda activate py37
pip install scann

and the import worked fine, and ScaNN managed to train and search ok too. This was all done on an Ivy Bridge CPU that should be very similar to your i7-3770K.

rickycorte commented 3 years ago

I got the issue by running this to create the environment:

conda create -n py37 python=3.7 anaconda
conda activate py37
pip install tensorflow
pip install tensorflow-recommenders
pip install scann

I run export PYTHONFAULTHANDLER="1" to see the crash stack trace Now run python and type import tensorflow_recommenders. In this way i obtain a crash that have a stack trace similar to the ones I've posted before.

If i try to import directly scann:

[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scann
2021-07-15 23:42:32.395084: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Fatal Python error: Illegal instruction

Current thread 0x00007fa464554740 (most recent call first):
  File "/home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 58 in load_op_library
  File "/home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/scann/scann_ops/py/scann_ops.py", line 26 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1035 in _handle_fromlist
  File "/home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/scann/__init__.py", line 2 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "<stdin>", line 1 in <module>
Illegal instruction (core dumped)

If i run pip uninstall scann an retry to import tensroflow_recommenders everything works fine.

Found existing installation: scann 1.2.2
Uninstalling scann-1.2.2:
  Would remove:
    /home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/scann-1.2.2.dist-info/*
    /home/rickycorte/anaconda3/envs/py37/lib/python3.7/site-packages/scann/*
Proceed (y/n)? y
  Successfully uninstalled scann-1.2.2
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow_recommenders
2021-07-15 23:44:38.629808: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> 

Edit: i tried to install only scann as you did and i got the same error. I'm starting to think that maybe there is something wrong with my cuda installation outside of conda. I'll try out on a virtual machine with a clean ubuntu installation without any nvidia library.

Edit 2: Tried on a vm made right now using native python 3.8, no nvidia cuda but the issue persist.

Edit 3: Tried on new vm running on top of windows 10 and still the same (no cuda on host and guest). At this point I guess its some kind of issue of my machine that is not easly reproducible.

emikulic commented 2 years ago

I think I'm seeing the crash in scann:

Thread 1 "python" received signal SIGILL, Illegal instruction.                    
0x00007fff183f910a in google::protobuf::FieldDescriptorProto::FieldDescriptorProto() () from .../python3.10/site-packages/scann/scann_ops/cc/_scann_ops.so

   0x00007fff183f9105 <+53>:  vmovq  %rax,%xmm0
=> 0x00007fff183f910a <+58>:  vpbroadcastq %xmm0,%ymm0
   0x00007fff183f910f <+63>:  vmovdqu %ymm0,0x18(%rbx)

I'm pretty sure VPBROADCAST from xmm to ymm is an AVX512 instruction, which my CPU (Sandybridge) doesn't have.

sammymax commented 2 years ago

Thanks for debugging--I think the issue is that vpbroadcastq is an AVX2 instruction, which Sandy Bridge doesn't support. We will look into compiling the ScaNN wheels the next release without the -mavx2 flag so that this issue is resolved. You can try compiling ScaNN yourself without that flag in the meantime to see if that fixes the problem.

emikulic commented 2 years ago

Nice, thanks! I was able to build scann without AVX2 using an older version of bazel :)

avber commented 2 years ago

@emikulic could you post steps to build scann without AVX2 using an older version of bazel? We have troubles installing old versions of bazel

Thanks

avber commented 2 years ago

@sammymax a docker file for the build environment would be nice to have

sammymax commented 2 years ago

Here's a related Dockerfile that might help; it compiles a version of TensorFlow Serving linked against ScaNN: https://github.com/google-research/google-research/blob/master/scann/tf_serving/Dockerfile.devel

What problems have you encountered with old versions of Bazel?

emikulic commented 2 years ago

This worked for me:

git clone git@github.com:google-research/google-research.git --depth=1
cd google-research/scann/
python configure.py
# get https://github.com/bazelbuild/bazelisk/releases/download/v1.12.0/bazelisk-linux-amd64
# install as "bazel"
echo 3.7.2 > .bazelversion
# note -march=native instead of -march=avx2:
CC=clang bazel build -c opt --features=thin_lto --copt=-march=native --cxxopt="-std=c++17" --copt=-fsized-deallocation --copt=-w :build_pip_pkg
./bazel-bin/build_pip_pkg
# produces scann-1.2.7-cp310-cp310-linux_x86_64.whl which you can "pip install"
avber commented 2 years ago

@emikulic @sammymax Thank you, we have compiled scann-1.2.7 successfully.

However, export of a trained TF Lite model failed (TF 2.9.1 and scann-1.2.7). Export worked on Colab (TF 2.8.2 and scann-1.2.6)

@sammymax Is it possible to check out scann-1.2.6 branch from the repo?

As for Bazel, the sysadmin installed a 5.x version but couldn't downgrade it for some reason.

emikulic commented 2 years ago

Were you able to use bazelisk to get an older version of bazel?

avber commented 2 years ago

Yes, the sysadmin managed to install an older version of bazel.

sammymax commented 2 years ago

ScaNN 1.2.8 was recently released and doesn't assume AVX2 support; we now compile with -mavx rather than -mavx2, and do runtime dispatch to AVX2, when supported, for the important routines. Hopefully this helps.