Face Detection terminates with SIGILL

ITblacksheep commented 3 years ago

[32m2021-08-17 13:03:14,119 INFO Starting 1 classify.face workers[0m [32m2021-08-17 13:03:14,141 INFO Running task: classify.face - 0df026d9-ba63-4459-874b-11799ebd2d19[0m [32m2021-08-17 13:03:14,119 INFO Starting 1 classify.face workers[0m [32m2021-08-17 13:03:14,141 INFO Running task: classify.face - 0df026d9-ba63-4459-874b-11799ebd2d19[0m 2021-08-17 09:03:21,150 INFO success: classification_face_detection_processor entered RUNNING state, process has stayed up for > than 21 seconds (startsecs) 2021-08-17 09:03:21,150 INFO success: classification_face_detection_processor entered RUNNING state, process has stayed up for > than 21 seconds (startsecs) 2021-08-17 09:03:23,923 INFO exited: classification_face_detection_processor (terminated by SIGILL; not expected) 2021-08-17 09:03:24,926 INFO spawned: 'classification_face_detection_processor' with pid 7137

damianmoore commented 3 years ago

Hi @ITblacksheep. Thanks for reporting the issue and sorry you had trouble. Could you provide some more info about the machine (memory etc.) and whether this occurred when starting/doing a big import etc.? I'm wondering if it could be an out-of-memory error or if it could be a particular image that it's failing to handle. If you can find a particular test image that always causes the error that would be very helpful. Thanks.

ITblacksheep commented 3 years ago

My server has 64 Gigs of ram. This is a large import as I have 10k images, so far

ITblacksheep commented 3 years ago

When running it manually this is what I get.

sleep 11 && nice -n 19 python /srv/photonix/manage.py classification_face_processor 2021-08-30 12:10:27,046 WARNING Limited tf.compat.v2.summary API due to missing TensorBoard installation. 2021-08-30 12:10:27,064 WARNING Limited tf.compat.v2.summary API due to missing TensorBoard installation. 2021-08-30 12:10:27,065 WARNING Limited tf.compat.v2.summary API due to missing TensorBoard installation. 2021-08-30 12:10:27,093 WARNING Limited tf.summary API due to missing TensorBoard installation. 2021-08-30 12:10:27,373 INFO Starting 1 classify.face workers 2021-08-30 12:13:06,546 INFO Running task: classify.face - 2de36454-1105-4791-898a-383a2f16cc9f

Illegal instruction

ITblacksheep commented 3 years ago

Dump from strace

getpid() = 13165 stat("/data/models/face/6b98410c-bd60-49a7-9f27-6dc6b7fa4108_retrained_version.txt", {st_mode=S_IFREG|0644, st_size=14, ...}) = 0 openat(AT_FDCWD, "/data/models/face/6b98410c-bd60-49a7-9f27-6dc6b7fa4108_retrained_version.txt", O_RDONLY|O_CLOEXEC) = 10 fstat(10, {st_mode=S_IFREG|0644, st_size=14, ...}) = 0 ioctl(10, TCGETS, 0x7fff8d4153c0) = -1 ENOTTY (Inappropriate ioctl for device) lseek(10, 0, SEEK_CUR) = 0 ioctl(10, TCGETS, 0x7fff8d4152e0) = -1 ENOTTY (Inappropriate ioctl for device) lseek(10, 0, SEEK_CUR) = 0 fstat(10, {st_mode=S_IFREG|0644, st_size=14, ...}) = 0 read(10, "20210825141008", 15) = 14 read(10, "", 1) = 0 close(10) = 0 openat(AT_FDCWD, "/data/models/face/6b98410c-bd60-49a7-9f27-6dc6b7fa4108_faces.ann", O_RDONLY) = 10 lseek(10, 0, SEEK_END) = 21648 mmap(NULL, 21648, PROT_READ, MAP_SHARED, 10, 0) = 0x14e20e8d8000 openat(AT_FDCWD, "/data/models/face/6b98410c-bd60-49a7-9f27-6dc6b7fa4108_faces_tag_ids.json", O_RDONLY|O_CLOEXEC) = 11 fstat(11, {st_mode=S_IFREG|0644, st_size=1400, ...}) = 0 ioctl(11, TCGETS, 0x7fff8d415510) = -1 ENOTTY (Inappropriate ioctl for device) lseek(11, 0, SEEK_CUR) = 0 ioctl(11, TCGETS, 0x7fff8d415430) = -1 ENOTTY (Inappropriate ioctl for device) lseek(11, 0, SEEK_CUR) = 0 fstat(11, {st_mode=S_IFREG|0644, st_size=1400, ...}) = 0 read(11, "[\"a4ccb352-e763-45d4-b2cb-539023"..., 1401) = 1400 read(11, "", 1) = 0 close(11) = 0 getpid() = 13165 ioctl(3, FIONBIO, [1]) = 0 recvfrom(3, 0x55575b5ddd80, 65536, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) ioctl(3, FIONBIO, [0]) = 0 sendto(3, "*7\r\n$7\r\nEVALSHA\r\n$40\r\nae7cc25ea7"..., 179, 0, NULL, 0) = 179 recvfrom(3, ":0\r\n", 65536, 0, NULL, NULL) = 4 getpid() = 13165 --- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x14e25e3e5056} ---

Vlad1mir-D commented 3 years ago

Cause of this issue: https://github.com/tensorflow/tensorflow/issues/18275 TLDR: TensorFlow requires AVX but your processor (and mine too) doesn't have support for these instruction set.

ITblacksheep commented 3 years ago

@Vlad1mir-D i checked cpuinfo. I got avx flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d

Vlad1mir-D commented 3 years ago

@ITblacksheep But you don't have avx2, avx512. I'm not exactly sure what optimization options were used for building generic tensorflow in pypi but you could always use intel-tensorflow which has optimization options well-documented here: https://software.intel.com/content/www/us/en/develop/articles/intel-optimization-for-tensorflow-installation-guide.html#Additional%20Info

BTW There is a chance I'm going to publish Docker images of photonix with arch-optimized builds of hot binaries like tensorflow, exiftools, libheif and so on but... this chance is relatively small as this would require a lot of time spent on compiling as just building image with an optimized tensorflow for my dedicated Avoton C2750 took a lot of my spare time :(

ITblacksheep commented 3 years ago

@Vlad1mir-D Face detection is working on some faces but it is failing sporadically. I also don't see where Tensorflow requires anymore just plain avx. Is there anywhere I can crank up the logging to see if I can find the issue.

Vlad1mir-D commented 3 years ago

@ITblacksheep https://drive.google.com/drive/folders/1W3dG4YdNw772zfUEf1KazqYIbKtXVsZ9?usp=sharing - these were the packages highly optimized for Intel Silvermont architecture (specifically - Atom C2758 CPU) where AVX isn't supported. Binaries built with the following additions to CFLAGS, CPPFLAGS, CXXFLAGS and FORTRANFLAGS:

-O3 -pipe -fexceptions -grecord-gcc-switches -march=silvermont -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-sgx -mno-bmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm -mno-avx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mrdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mprfchw -mno-adx -mfxsr -mno-xsave -mno-xsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mno-pku -mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni -mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-avx512vpopcntdq -mno-movdiri -mno-movdir64b -mno-waitpkg -mno-cldemote -mno-ptwrite -mno-avx512bf16 -mno-enqcmd -mno-avx512vp2intersect --param l1-cache-size=24 --param l1-cache-line-size=64 --param l2-cache-size=1024 -mtune=silvermont

To include these packages into the Photonix Docker image, follow these steps:

Comment out the following lines in docker/install_and_upload_python_packages.py:

    #if dependency.startswith('tensorflow') and os.uname().machine == 'x86_64':
    #    tf_version = re.search('\d+.\d+.\d+', dependency).group(0)
    #    dependency = f'https://pypi.epixstudios.co.uk/packages/tensorflow-{tf_version}-cp38-cp38-linux_x86_64.whl'

Create directory dist and put packages you want to use into this directory. To fix SIGILL you should add at least TensorFlow package.
Replace package versions in the requirements.txt. I'm using the following versions that working perfectly fine:
```
numpy==1.19.2
scipy==1.7.1
matplotlib==3.4.3
tensorflow==2.4.3
opencv-python==4.5.3.56
annoy==1.17.0
```

Add the following lines:

COPY dist /tmp/dist
RUN bash -c 'if [[ -d /tmp/dist ]]; then pip install /tmp/dist/*; fi'

into the docker/Dockerfile.prd after the following lines:

COPY requirements.txt /srv/requirements.txt
COPY docker/install_and_upload_python_packages.py /root/install_and_upload_python_packages.py

Rebuild Photonix Docker image

That's it! No more SIGILLs as Tensorflow wouldn't attempt to use any of the AVX instruction sets.

photonixapp / photonix

Face Detection terminates with SIGILL #324