Closed shivabohemian closed 1 month ago
could you offer our full logs to investigate.
https://github.com/milvus-io/milvus/tree/master/deployments/export-log
From your logs you need to check about your disk
{"level":"warn","ts":"2024-08-30T20:05:20.732+0800","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"1.495579724s
etcd takes 1.5s to apply request, which is too long. Did you make sure you deploy milvus on a ssd disk?
/assign @shivabohemian /unassign
I put etcd on emmc. This is the log of milvus from startup to query crash.[output.log] On August 31 at 11:22:21 I ran a vector query and it crashed. The sdk reported the same error as mentioned in the issue, the milvus log did not have a panic message, but it did crash and restart.
By the way, I packaged the docker image v2.4.10 into a deb package. It seems that the libraries in the lib folder of the image do not have symbolic links, resulting in some libraries being duplicated, which increases the size of the image.
48M libblob-chunk-manager.so
80K libdouble-conversion.so
80K libdouble-conversion.so.3
80K libdouble-conversion.so.3.2.0
256K libevent_core-2.1.so
256K libevent_core-2.1.so.7
256K libevent_core-2.1.so.7.0.1
256K libevent_core.so
172K libevent_extra-2.1.so
172K libevent_extra-2.1.so.7
172K libevent_extra-2.1.so.7.0.1
172K libevent_extra.so
6.5M libevent_openssl-2.1.so
6.5M libevent_openssl-2.1.so.7
6.5M libevent_openssl-2.1.so.7.0.1
6.5M libevent_openssl.so
20K libevent_pthreads-2.1.so
20K libevent_pthreads-2.1.so.7
20K libevent_pthreads-2.1.so.7.0.1
20K libevent_pthreads.so
113M libfolly.so
113M libfolly.so.0.58.0-dev
724K libfolly_exception_counter.so
724K libfolly_exception_counter.so.0.58.0-dev
600K libfolly_exception_tracer.so
600K libfolly_exception_tracer.so.0.58.0-dev
6.8M libfolly_exception_tracer_base.so
6.8M libfolly_exception_tracer_base.so.0.58.0-dev
8.4M libfolly_test_util.so
8.4M libfolly_test_util.so.0.58.0-dev
11M libfollybenchmark.so
11M libfollybenchmark.so.0.58.0-dev
204K libgflags_nothreads.so
204K libgflags_nothreads.so.2.2
204K libgflags_nothreads.so.2.2.2
252K libglog.so
252K libglog.so.0.6.0
252K libglog.so.1
384K libhwloc.so
384K libhwloc.so.15
384K libhwloc.so.15.6.4
4.3M libjemalloc.so
4.3M libjemalloc.so.2
278M libknowhere.so
434M libmilvus_core.so
236K librdkafka++.so
236K librdkafka++.so.1
9.9M librdkafka.so
9.9M librdkafka.so.1
13M librocksdb.so
13M librocksdb.so.6
13M librocksdb.so.6.29.5
336K libtbb.so
336K libtbb.so.12
336K libtbb.so.12.9
28K libtbbbind_2_5.so
28K libtbbbind_2_5.so.3
28K libtbbbind_2_5.so.3.9
you need to check the speed of your emmc.
Milvus usuaully got panic when etcd is too slow.
set common.session.ttl to longer might help a little bit but we need ssd here
Did we make this change? It worked fine until v2.4.6
I started the binary package directly and saw an error with an Illegal instruction
.
[2024/09/01 17:58:26.629 +08:00] [INFO] [proxy/meta_cache.go:493] ["meta update success"] [database=default] [collectionName=img_feature] [collectionID=452217503082086433]
[2024/09/01 17:58:26.629 +08:00] [INFO] [querycoordv2/services.go:136] ["show partitions request received"] [traceID=a52b57943dd2a8a2b69be49b28ff9be3] [collectionID=452217503082086433] [partitions="[452217503082088626]"]
[2024/09/01 17:58:26.630 +08:00] [INFO] [rootcoord/root_coord.go:2811] ["received request to describe database "] [traceID=715635b06f7b2b6091a5e736caf34cc8] [dbName=default]
[2024/09/01 17:58:26.630 +08:00] [INFO] [rootcoord/root_coord.go:2835] ["done to describe database"] [traceID=715635b06f7b2b6091a5e736caf34cc8] [dbName=default] [ts=452246819731668997]
[2024/09/01 17:58:26.631 +08:00] [INFO] [proxy/meta_cache.go:1047] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=715635b06f7b2b6091a5e736caf34cc8] [collectionName=img_feature] [collectionID=452217503082086433]
Illegal instruction
could you try lscpu and check what kind of architecture you've been running on?
We've been working with x86(better to be latter version like Icelake or later), mac M1,M2,M3, and aws graviton and all works file.
We can try to reproduce if we get same machien as you do.
Your code of how you use milvus could also be really helpful so we know which part of the code could cause the problem
The lscpu information is as follows, and it's an x86 architecture.
I simply created the ivf_sq8 index and inserted the data, and then the search had this problem. In my tests, it will be like this since v2.4.8.
I directly use the compiled files from the Docker image for creating and installing the deb package.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
Model name: Intel(R) Celeron(R) N5105 @ 2.00GHz
BIOS Model name: Intel(R) Celeron(R) N5105 @ 2.00GHz To Be Filled By O.E.M. CPU @ 2.8GHz
BIOS CPU family: 15
CPU family: 6
Model: 156
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: 0
CPU(s) scaling MHz: 28%
CPU max MHz: 2900.0000
CPU min MHz: 800.0000
BogoMIPS: 3993.60
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc art arch_perfmon pebs bt
s rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes6
4 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 s
sbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase t
sc_adjust smep erms rdt_a rdseed smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xge
tbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_e
pp hwp_pkg_req umip waitpkg gfni rdpid movdiri movdir64b md_clear flush_l1d arch_capabili
ties
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1.5 MiB (1 instance)
L3: 4 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled
Reg file data sampling: Mitigation; Clear Register File
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not aff
ected; BHI SW loop, KVM SW loop
Srbds: Vulnerable: No microcode
Tsx async abort: Not affected
The lscpu information is as follows, and it's an x86 architecture.
I simply created the ivf_sq8 index and inserted the data, and then the search had this problem. In my tests, it will be like this since v2.4.8.
I directly use the compiled files from the Docker image for creating and installing the deb package.
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation Model name: Intel(R) Celeron(R) N5105 @ 2.00GHz BIOS Model name: Intel(R) Celeron(R) N5105 @ 2.00GHz To Be Filled By O.E.M. CPU @ 2.8GHz BIOS CPU family: 15 CPU family: 6 Model: 156 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 0 CPU(s) scaling MHz: 28% CPU max MHz: 2900.0000 CPU min MHz: 800.0000 BogoMIPS: 3993.60 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc art arch_perfmon pebs bt s rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes6 4 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 s sbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase t sc_adjust smep erms rdt_a rdseed smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xge tbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_e pp hwp_pkg_req umip waitpkg gfni rdpid movdiri movdir64b md_clear flush_l1d arch_capabili ties Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1.5 MiB (1 instance) L3: 4 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-3 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled Reg file data sampling: Mitigation; Clear Register File Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not aff ected; BHI SW loop, KVM SW loop Srbds: Vulnerable: No microcode Tsx async abort: Not affected
I think the machine is old and don't support avx2 and avx512, that might be the reason.
We'll fix that soon.
Meantime you'd better use some machines with avx support so milvus will be much faster
Ok, thank you for your advice.
Everything is fine on v2.4.6, you can compare it to fix the problem.
Looking into this @PwzXxm @chasingegg
The machine miss f16c flag. Gcc compiler check this flag valid, and inline some f16c instruction in SSE function. This problem has been fix in knowhere pr#814, and you can update milvus to 2.4.11.
v2.4.11 releasing soon. Thanks for opening the issue and letting us know.
Thank you for the efficient fix. I will try it after the new version is released.
i'd close this issue as comments above. please feel to file a new one if it reproduced on new version.
Is there an existing issue for this?
Environment
Current Behavior
Search error. It looks like the service has exited unexpectedly and restarted.
failed to search: loaded collection do not found any channel in target, may be in recovery: collection on recovering[collection=452203396426891450]
Expected Behavior
Search normal.
Steps To Reproduce
Milvus Log
Anything else?
No response