Open ITblacksheep opened 3 years ago
Hi @ITblacksheep. Thanks for reporting the issue and sorry you had trouble. Could you provide some more info about the machine (memory etc.) and whether this occurred when starting/doing a big import etc.? I'm wondering if it could be an out-of-memory error or if it could be a particular image that it's failing to handle. If you can find a particular test image that always causes the error that would be very helpful. Thanks.
My server has 64 Gigs of ram. This is a large import as I have 10k images, so far
When running it manually this is what I get.
sleep 11 && nice -n 19 python /srv/photonix/manage.py classification_face_processor 2021-08-30 12:10:27,046 WARNING Limited tf.compat.v2.summary API due to missing TensorBoard installation. 2021-08-30 12:10:27,064 WARNING Limited tf.compat.v2.summary API due to missing TensorBoard installation. 2021-08-30 12:10:27,065 WARNING Limited tf.compat.v2.summary API due to missing TensorBoard installation. 2021-08-30 12:10:27,093 WARNING Limited tf.summary API due to missing TensorBoard installation. 2021-08-30 12:10:27,373 INFO Starting 1 classify.face workers 2021-08-30 12:13:06,546 INFO Running task: classify.face - 2de36454-1105-4791-898a-383a2f16cc9f
Illegal instruction
Dump from strace
getpid() = 13165 stat("/data/models/face/6b98410c-bd60-49a7-9f27-6dc6b7fa4108_retrained_version.txt", {st_mode=S_IFREG|0644, st_size=14, ...}) = 0 openat(AT_FDCWD, "/data/models/face/6b98410c-bd60-49a7-9f27-6dc6b7fa4108_retrained_version.txt", O_RDONLY|O_CLOEXEC) = 10 fstat(10, {st_mode=S_IFREG|0644, st_size=14, ...}) = 0 ioctl(10, TCGETS, 0x7fff8d4153c0) = -1 ENOTTY (Inappropriate ioctl for device) lseek(10, 0, SEEK_CUR) = 0 ioctl(10, TCGETS, 0x7fff8d4152e0) = -1 ENOTTY (Inappropriate ioctl for device) lseek(10, 0, SEEK_CUR) = 0 fstat(10, {st_mode=S_IFREG|0644, st_size=14, ...}) = 0 read(10, "20210825141008", 15) = 14 read(10, "", 1) = 0 close(10) = 0 openat(AT_FDCWD, "/data/models/face/6b98410c-bd60-49a7-9f27-6dc6b7fa4108_faces.ann", O_RDONLY) = 10 lseek(10, 0, SEEK_END) = 21648 mmap(NULL, 21648, PROT_READ, MAP_SHARED, 10, 0) = 0x14e20e8d8000 openat(AT_FDCWD, "/data/models/face/6b98410c-bd60-49a7-9f27-6dc6b7fa4108_faces_tag_ids.json", O_RDONLY|O_CLOEXEC) = 11 fstat(11, {st_mode=S_IFREG|0644, st_size=1400, ...}) = 0 ioctl(11, TCGETS, 0x7fff8d415510) = -1 ENOTTY (Inappropriate ioctl for device) lseek(11, 0, SEEK_CUR) = 0 ioctl(11, TCGETS, 0x7fff8d415430) = -1 ENOTTY (Inappropriate ioctl for device) lseek(11, 0, SEEK_CUR) = 0 fstat(11, {st_mode=S_IFREG|0644, st_size=1400, ...}) = 0 read(11, "[\"a4ccb352-e763-45d4-b2cb-539023"..., 1401) = 1400 read(11, "", 1) = 0 close(11) = 0 getpid() = 13165 ioctl(3, FIONBIO, [1]) = 0 recvfrom(3, 0x55575b5ddd80, 65536, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) ioctl(3, FIONBIO, [0]) = 0 sendto(3, "*7\r\n$7\r\nEVALSHA\r\n$40\r\nae7cc25ea7"..., 179, 0, NULL, 0) = 179 recvfrom(3, ":0\r\n", 65536, 0, NULL, NULL) = 4 getpid() = 13165 --- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x14e25e3e5056} ---
Cause of this issue: https://github.com/tensorflow/tensorflow/issues/18275 TLDR: TensorFlow requires AVX but your processor (and mine too) doesn't have support for these instruction set.
@Vlad1mir-D i checked cpuinfo. I got avx flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
@ITblacksheep But you don't have avx2, avx512.
I'm not exactly sure what optimization options were used for building generic tensorflow
in pypi but you could always use intel-tensorflow
which has optimization options well-documented here: https://software.intel.com/content/www/us/en/develop/articles/intel-optimization-for-tensorflow-installation-guide.html#Additional%20Info
BTW There is a chance I'm going to publish Docker images of photonix with arch-optimized builds of hot binaries like tensorflow, exiftools, libheif and so on but... this chance is relatively small as this would require a lot of time spent on compiling as just building image with an optimized tensorflow for my dedicated Avoton C2750 took a lot of my spare time :(
@Vlad1mir-D Face detection is working on some faces but it is failing sporadically. I also don't see where Tensorflow requires anymore just plain avx. Is there anywhere I can crank up the logging to see if I can find the issue.
@ITblacksheep
https://drive.google.com/drive/folders/1W3dG4YdNw772zfUEf1KazqYIbKtXVsZ9?usp=sharing - these were the packages highly optimized for Intel Silvermont architecture (specifically - Atom C2758 CPU) where AVX isn't supported.
Binaries built with the following additions to CFLAGS
, CPPFLAGS
, CXXFLAGS
and FORTRANFLAGS
:
-O3 -pipe -fexceptions -grecord-gcc-switches -march=silvermont -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-sgx -mno-bmi2 -mno-pconfig -mno-wbnoinvd -mno-tbm -mno-avx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mrdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mprfchw -mno-adx -mfxsr -mno-xsave -mno-xsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mno-pku -mno-rdpid -mno-gfni -mno-shstk -mno-avx512vbmi2 -mno-avx512vnni -mno-vaes -mno-vpclmulqdq -mno-avx512bitalg -mno-avx512vpopcntdq -mno-movdiri -mno-movdir64b -mno-waitpkg -mno-cldemote -mno-ptwrite -mno-avx512bf16 -mno-enqcmd -mno-avx512vp2intersect --param l1-cache-size=24 --param l1-cache-line-size=64 --param l2-cache-size=1024 -mtune=silvermont
To include these packages into the Photonix Docker image, follow these steps:
docker/install_and_upload_python_packages.py
:
#if dependency.startswith('tensorflow') and os.uname().machine == 'x86_64':
# tf_version = re.search('\d+.\d+.\d+', dependency).group(0)
# dependency = f'https://pypi.epixstudios.co.uk/packages/tensorflow-{tf_version}-cp38-cp38-linux_x86_64.whl'
dist
and put packages you want to use into this directory.
To fix SIGILL you should add at least TensorFlow package.requirements.txt
.
I'm using the following versions that working perfectly fine:
numpy==1.19.2
scipy==1.7.1
matplotlib==3.4.3
tensorflow==2.4.3
opencv-python==4.5.3.56
annoy==1.17.0
COPY dist /tmp/dist
RUN bash -c 'if [[ -d /tmp/dist ]]; then pip install /tmp/dist/*; fi'
into the docker/Dockerfile.prd
after the following lines:
COPY requirements.txt /srv/requirements.txt
COPY docker/install_and_upload_python_packages.py /root/install_and_upload_python_packages.py
That's it! No more SIGILLs as Tensorflow wouldn't attempt to use any of the AVX instruction sets.
[32m2021-08-17 13:03:14,119 INFO Starting 1 classify.face workers[0m [32m2021-08-17 13:03:14,141 INFO Running task: classify.face - 0df026d9-ba63-4459-874b-11799ebd2d19[0m [32m2021-08-17 13:03:14,119 INFO Starting 1 classify.face workers[0m [32m2021-08-17 13:03:14,141 INFO Running task: classify.face - 0df026d9-ba63-4459-874b-11799ebd2d19[0m 2021-08-17 09:03:21,150 INFO success: classification_face_detection_processor entered RUNNING state, process has stayed up for > than 21 seconds (startsecs) 2021-08-17 09:03:21,150 INFO success: classification_face_detection_processor entered RUNNING state, process has stayed up for > than 21 seconds (startsecs) 2021-08-17 09:03:23,923 INFO exited: classification_face_detection_processor (terminated by SIGILL; not expected) 2021-08-17 09:03:24,926 INFO spawned: 'classification_face_detection_processor' with pid 7137