scanner-research / scanner

Efficient video analysis at scale
https://scanner-research.github.io/
Apache License 2.0
620 stars 108 forks source link

Failed to start worker. #232

Closed sth1997 closed 5 years ago

sth1997 commented 5 years ago

I built scanner from source , and I can run quickstart with cpu successfully. When I want to use my cluster instead of only one node, I specified " db = Database(master = 'localhost:5001', workers = ['172.23.33.36:5002'], start_cluster=True)". And then I got this error:

Traceback (most recent call last): File "main.py", line 27, in output_tables = db.run(output=output_frame, jobs=[job], force=True) File "/home/sth/.local/lib/python3.6/site-packages/scannerpy/database.py", line 1581, in run self.start_workers(self._worker_paths) File "/home/sth/.local/lib/python3.6/site-packages/scannerpy/database.py", line 791, in start_workers 'Timed out waiting for workers to connect to master') scannerpy.common.ScannerException: Timed out waiting for workers to connect to master sth@gorgon5:~/video/scanner/scanner/examples/apps/quickstart$ Traceback (most recent call last): File "", line 70, in File "/home/sth/.local/lib/python3.6/site-packages/scannerpy/database.py", line 1777, in start_worker watchdog, db File "/home/sth/.local/lib/python3.6/site-packages/scannerpy/database.py", line 1692, in worker_process result.msg())) scannerpy.common.ScannerException: Failed to start worker:

What can be the cause of this error? Anyone experiences something similar? Thanks. @apoms @willcrichton

sth1997 commented 5 years ago

Here are the log info in scanner_worker.INFO: Log file created at: 2018/11/13 23:51:27 Running on machine: gorgon4 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I1113 23:51:27.198902 119295 worker.cpp:480] Creating worker I1113 23:51:27.199143 119295 worker.cpp:497] Create master stub I1113 23:51:27.199800 119295 worker.cpp:500] Finish master stub I1113 23:51:27.199872 119295 worker.cpp:507] Worker created. I1113 23:51:27.200111 119295 worker.cpp:666] Worker try to register with master W1113 23:51:27.201092 119295 worker.cpp:681] GRPC_BACKOFF: transient failure, sleeping for 1.84019 seconds. W1113 23:51:29.842720 119295 worker.cpp:681] GRPC_BACKOFF: transient failure, sleeping for 2.39438 seconds. W1113 23:51:32.354835 119295 worker.cpp:681] GRPC_BACKOFF: transient failure, sleeping for 4.7831 seconds. W1113 23:51:42.626754 119295 worker.cpp:681] GRPC_BACKOFF: transient failure, sleeping for 8.79844 seconds. W1113 23:51:53.493829 119295 worker.cpp:681] GRPC_BACKOFF: transient failure, sleeping for 16.9116 seconds. W1113 23:52:12.656826 119295 worker.cpp:681] GRPC_BACKOFF: transient failure, sleeping for 32.1976 seconds. W1113 23:53:33.756726 119295 worker.cpp:681] GRPC_BACKOFF: reached max backoff.

sth1997 commented 5 years ago

Here are the log info in scanner_worker.WARNING: Log file created at: 2018/11/13 23:51:27 Running on machine: gorgon4 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg W1113 23:51:27.201092 119295 worker.cpp:681] GRPC_BACKOFF: transient failure, sleeping for 1.84019 seconds. W1113 23:51:29.842720 119295 worker.cpp:681] GRPC_BACKOFF: transient failure, sleeping for 2.39438 seconds. W1113 23:51:32.354835 119295 worker.cpp:681] GRPC_BACKOFF: transient failure, sleeping for 4.7831 seconds. W1113 23:51:42.626754 119295 worker.cpp:681] GRPC_BACKOFF: transient failure, sleeping for 8.79844 seconds. W1113 23:51:53.493829 119295 worker.cpp:681] GRPC_BACKOFF: transient failure, sleeping for 16.9116 seconds. W1113 23:52:12.656826 119295 worker.cpp:681] GRPC_BACKOFF: transient failure, sleeping for 32.1976 seconds. W1113 23:53:33.756726 119295 worker.cpp:681] GRPC_BACKOFF: reached max backoff. W1113 23:53:33.756789 119295 worker.cpp:685] Worker could not contact master server at localhost:5001 (14): Connect Failed

sth1997 commented 5 years ago

Oh, I think this problem can be solved by sepcify "master = 'xxx.xx.xx.xx:5001' " (the real ip address of my master") instead of "master = 'localhost:5001' ".

fpoms commented 5 years ago

I believe the issue here was that the worker was trying to look for the master at localhost:5001, which is not the correct address if the worker is not on the same machine as the master.