scanner-research / scanner

Efficient video analysis at scale
https://scanner-research.github.io/
Apache License 2.0
615 stars 108 forks source link

Failed to create posix Database #209

Closed ahirner closed 5 years ago

ahirner commented 6 years ago

Throws terminate called after throwing an instance of 'Xbyak::Error' on latest scannerresearch/scanner:cpu image and a docker 18.03.1-ce host on Ubuntu 16.04.

The local database seems to get initialized.

$ docker run -it --name scanner scannerresearch/scanner:cpu /bin/bash
root@6ae5d2e588cf:/opt/scanner# python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from scannerpy import Database
>>> db = Database()
Your Scanner configuration file (/root/.scanner/config.toml) does not exist. Create one? [Y/n] Y
Wrote Scanner configuration to /root/.scanner/config.toml
terminate called after throwing an instance of 'Xbyak::Error'
  what():  internal error
Aborted (core dumped)
root@6ae5d2e588cf:/opt/scanner# cat /root/.scanner/config.toml
[storage]
db_path = "/root/.scanner/db"
type = "posix"
[network]
master = "localhost"
master_port = "5001"
worker_port = "5002"
root@6ae5d2e588cf:/opt/scanner# ls -lh /root/.scanner/db
total 0
-rw-r--r-- 1 root root 0 Jun  9 13:50 db_metadata.bin

Similar issues crop up on tensorflow. More specifically, it seems some grpc call goes wrong. This is the stacktrace:

Fatal Python error: Cannot recover from stack overflow.

Current thread 0x00007f49c1b05700 (most recent call first):
  File "/root/.local/lib/python3.5/site-packages/grpc/_common.py", line 87 in _transform
  File "/root/.local/lib/python3.5/site-packages/grpc/_common.py", line 94 in serialize
  File "/root/.local/lib/python3.5/site-packages/grpc/_channel.py", line 416 in _start_unary_request
  File "/root/.local/lib/python3.5/site-packages/grpc/_channel.py", line 467 in _prepare
  File "/root/.local/lib/python3.5/site-packages/grpc/_channel.py", line 484 in _blocking
  File "/root/.local/lib/python3.5/site-packages/grpc/_channel.py", line 499 in __call__
  File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 780 in <lambda>
  File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 355 in _try_rpc
  File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 779 in stop_cluster
  File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 347 in _handle_signal
  File "/root/.local/lib/python3.5/site-packages/grpc/_channel.py", line 494 in _blocking
  File "/root/.local/lib/python3.5/site-packages/grpc/_channel.py", line 499 in __call__
  File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 780 in <lambda>
  File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 355 in _try_rpc
  File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 779 in stop_cluster
  File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 347 in _handle_signal
  File "/root/.local/lib/python3.5/site-packages/grpc/_channel.py", line 494 in _blocking
  File "/root/.local/lib/python3.5/site-packages/grpc/_channel.py", line 499 in __call__
  File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 780 in <lambda>
  File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 355 in _try_rpc
  File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 779 in stop_cluster
  File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 347 in _handle_signal
  File "/usr/lib/python3.5/logging/__init__.py", line 133 in getLevelName
  File "/usr/lib/python3.5/logging/__init__.py", line 273 in __init__
  File "/usr/lib/python3.5/logging/__init__.py", line 1384 in makeRecord
  File "/usr/lib/python3.5/logging/__init__.py", line 1414 in _log
  File "/usr/lib/python3.5/logging/__init__.py", line 1308 in error
  File "/usr/lib/python3.5/logging/__init__.py", line 1805 in error
  File "/usr/lib/python3.5/logging/__init__.py", line 1813 in exception
  ...

Where does the JIT assembler come in and what else could cause such error?

ahirner commented 6 years ago

Update: Once I rebuilt the image without CAFFE_OPS it worked. I'll continue with Caffe2 since we need that anyhow.

fpoms commented 6 years ago

Hi @ahirner,

Both of the issues disappeared when you built with CAFFE_OPS=OFF?

The second issue seems like it might be related to a wrong GRPC or Protobuf version. Can you show the output of pip3 show protobuf and pip3 show grpcio?

ahirner commented 6 years ago

Hi @apoms, thx for the response.

The trace dumps right after a docker stop scanner. SIGKILL isn't picked up after the first error which is why I think the trace is related to db = Database(). Sry, I should have made that clear.

I was using the latest image on dockerhub, so:

$ docker run --name scanner -it scannerresearch/scanner:cpu pip3 show protobuf grpcio
---
Metadata-Version: 2.0
Name: protobuf
Version: 3.5.1
Summary: Protocol Buffers
Home-page: https://developers.google.com/protocol-buffers/
Author: protobuf@googlegroups.com
Author-email: protobuf@googlegroups.com
Installer: pip
License: 3-Clause BSD License
Location: /root/.local/lib/python3.5/site-packages
Requires: setuptools, six
Classifiers:
  Programming Language :: Python
  Programming Language :: Python :: 2
  Programming Language :: Python :: 2.7
  Programming Language :: Python :: 3
  Programming Language :: Python :: 3.3
  Programming Language :: Python :: 3.4
---
Metadata-Version: 2.0
Name: grpcio
Version: 1.12.0
Summary: HTTP/2-based RPC framework
Home-page: https://grpc.io
Author: The gRPC Authors
Author-email: grpc-io@googlegroups.com
Installer: pip
License: Apache License 2.0
Location: /root/.local/lib/python3.5/site-packages
Requires: six
Classifiers:
  Development Status :: 5 - Production/Stable
  Programming Language :: Python
  Programming Language :: Python :: 2
  Programming Language :: Python :: 2.7
  Programming Language :: Python :: 3
  Programming Language :: Python :: 3.4
  Programming Language :: Python :: 3.5
  Programming Language :: Python :: 3.6
  License :: OSI Approved :: Apache Software License
You are using pip version 8.1.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
fpoms commented 6 years ago

I just ran the same sequences of commands that you did using the latest docker image, but can't seem to reproduce the first error. Are you sure you pulled the latest image?

ahirner commented 6 years ago

scannerresearch/scanner cpu 7b465c166f58 47 hours ago 4.55GB It's definitely weird. Right now, I tried to reproduce it on my macbook with the same docker version as on our workstation. The error doesn't come up. My last guess is it must be the workstation (AMD Ryzen). Edge case!

willcrichton commented 5 years ago

Closing for now until we can reproduce.