scanner-research / scanner

Efficient video analysis at scale
https://scanner-research.github.io/
Apache License 2.0
620 stars 108 forks source link

Kubernetes AWS -- face detection example fails #214

Closed jblakley closed 6 years ago

jblakley commented 6 years ago

Porting the face_detection example to kubernetes. I get the following error:

Detecting faces in movie star_wars_heros.mp4 Ingesting video into Scanner ... Detecting faces... 0%| | 0/49 [03:01<?, ?it/s, workers=4, tasks=49, jobs=1] Traceback (most recent call last): File "k8face1.py", line 45, in movie_name + '_bboxes') File "/root/.local/lib/python3.5/site-packages/scannerpy/stdlib/pipelines.py", line 134, in detect_faces pipeline_instances_per_node=pipeline_instances) File "/root/.local/lib/python3.5/site-packages/scannerpy/database.py", line 1542, in run raise ScannerException(job_status.result.msg) scannerpy.common.ScannerException: No workers but have unfinished work after 180 seconds

Successfully able to run the example app, shot_detection and optical_flow on the cluster. Can run face detection on local container. It fails when running on the cluster. "k8face1.py" is my ported version of face_detection/main.py that only changes the database ingest for a cluster. db = Database(master=master, start_cluster=False, config_path='./config.toml')

The offending code seems to be:

print('Detecting faces...') [bboxes_table] = pipelines.detect_faces( db, [input_table.column('frame')], sampler, sampler_args, movie_name + '_bboxes')

Let me know if you need more info.

fpoms commented 6 years ago

Are there any notifications about failed workers? Can you get the logs from the master node? You can get the name of the master pod by running kubectl get pods -l 'app=scanner-master' and then get the logs from that node by running kubectl logs <name-of-master>.

jblakley commented 6 years ago

The main indication of an error in the logs is from the worker logs. Operative line seems to be: E0710 17:51:46.664628 827 caffe_kernel.cpp:240] Model path /root/.scanner/resources/facenet_deploy.prototxt does not exist.

In context: I0710 17:51:18.827316 44 worker.cpp:1873] All threads are finished I0710 17:51:18.827327 44 worker.cpp:1893] Max memory allocated: 7983 MBs I0710 17:51:18.827329 44 worker.cpp:1895] Current memory allocated: 0 MBs I0710 17:51:18.827334 44 worker.cpp:1901] Leaked allocations: I0710 17:51:18.913199 44 worker.cpp:1980] Worker 2 finished job I0710 17:51:46.501016 45 worker.cpp:546] Worker 2 received NewJob I0710 17:51:46.501039 45 metadata.cpp:452] Setting DB path to scanner_db I0710 17:51:46.501583 44 worker.cpp:726] Worker 2 loading Op library: /root/.local/lib/python3.5/site-packages/scannerpy/lib/libscanner_stdlib.so I0710 17:51:46.502040 44 metadata.cpp:452] Setting DB path to scanner_db I0710 17:51:46.554877 44 storehouse.h:45] Reading scanner_db/table_megafile.bin (size 8, pos 0) I0710 17:51:46.583654 44 storehouse.h:45] Reading scanner_db/table_megafile.bin (size 1052, pos 8) I0710 17:51:46.605309 44 storehouse.h:45] Reading scanner_db/table_megafile.bin (size 2104, pos 1060) I0710 17:51:46.637289 44 storehouse.h:45] Reading scanner_db/table_megafile.bin (size 89911, pos 3164) I0710 17:51:46.663859 44 worker.cpp:1226] Initial pipeline instances per node: 1 I0710 17:51:46.663878 44 worker.cpp:1266] Pipeline instances per node: 1 I0710 17:51:46.664127 818 load_worker.cpp:52] Source finished validation 1 I0710 17:51:46.664217 819 load_worker.cpp:52] Source finished validation 1 I0710 17:51:46.664361 821 load_worker.cpp:52] Source finished validation 1 I0710 17:51:46.664490 820 load_worker.cpp:52] Source finished validation 1 I0710 17:51:46.664572 822 load_worker.cpp:52] Source finished validation 1 I0710 17:51:46.664579 827 evaluate_worker.cpp:339] Kernel finished validation 1 E0710 17:51:46.664628 827 caffe_kernel.cpp:240] Model path /root/.scanner/resources/facenet_deploy.prototxt does not exist. I0710 17:51:46.664640 827 evaluate_worker.cpp:339] Kernel finished validation 0 E0710 17:51:46.664645 827 evaluate_worker.cpp:341] Kernel validate failed: Model path /root/.scanner/resources/facenet_deploy.prototxt does not exist. I0710 17:51:46.664666 823 load_worker.cpp:52] Source finished validation 1 I0710 17:51:46.664808 824 load_worker.cpp:52] Source finished validation 1 I0710 17:51:46.664932 825 load_worker.cpp:52] Source finished validation 1 W0710 17:51:46.665154 44 worker.cpp:1751] (N/KI/KG: 2/0/0) returned error result: Model path /root/.scanner/resources/facenet_deploy.prototxt does not exist. I0710 17:51:46.665160 818 worker.cpp:91] Load (N/PU: 2/0): processing job task (0, 2) I0710 17:51:46.665195 821 worker.cpp:122] Load (N/PU: 2/3): thread finished I0710 17:51:46.665196 819 worker.cpp:122] Load (N/PU: 2/1): thread finished I0710 17:51:46.665303 820 worker.cpp:122] Load (N/PU: 2/2): thread finished I0710 17:51:46.665307 822 worker.cpp:122] Load (N/PU: 2/4): thread finished I0710 17:51:46.665333 824 worker.cpp:122] Load (N/PU: 2/6): thread finished I0710 17:51:46.665323 823 worker.cpp:122] Load (N/PU: 2/5): thread finished

fpoms commented 6 years ago

The issue is that the docker image you built doesn't have the facenet_deploy.prototxt file, so the worker can't read it. The fix here is to modify the Dockerfile to include a

COPY location/of/facenet_deploy.prototxt ~/.scanner/resources/facenet_deploy.prototxt
jblakley commented 6 years ago

Got it working with that fix but needed a couple of other files as well: facenet_deploy.caffemodel and facenet_templates.bin. Thanks!