[QUESTION]: Milvus queries get really slow for about one hour per day, at roughly the same time

JamesDeAntonis commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

What is your question?

Milvus is working just fine for our team, with the one exception being that around 1 p.m. every day, queries get insanely slow (like 15 minutes per query), and that slowness persists for about an hour before returning to normal. Why does this happen, and is there some config we can set to fix this? Thanks!

Anything else?

FYI we've been running Milvus as a standalone service

yanliang567 commented 1 year ago

quick questions:

do you have any insert or delete requests during that hour?
do you have any other running services that sharing the same host with milvus?
could you please upload the milvus logs or milvus metrics screenshots?

yanliang567 commented 1 year ago

quick questions:

do you have any insert or delete requests during that hour?
do you have any other running services that sharing the same host with milvus?
could you please upload the milvus logs or milvus metrics screenshots?

JamesDeAntonis commented 1 year ago

We don't know, but the way our service works, insert requests happen all the time so it's very likely. Delete's are less common but still happen on occasion. And during the slowdown we're able to make fast queries locally
Yes, other services are sharing the same host
We get lots of logs like this, but this happens both during normal times and during the slowdown.

JamesDeAntonis commented 1 year ago

I didn't really conclude anything from this exercise, but after comparing logs from good times to logs from bad times, I saw these messages

in good not bad {'\"proxy search post execute reduce\"', '\"handle assign segment request\"', '\"Proxy Insert Execute done\"', 'reduceSearchResultData', '\"DescribeIndex enqueued\"', '\"Search Execute done\"', '\"Proxy Reduce Search Result\"', '\"Proxy segIDAssigner reduceSegReqs\"', '\"ScoreBasedBalancer has balanced collection, skip balancing in this round\"', '\"DescribeCollection enqueued\"', '\"updating segment memorySize\"', '\"Search post execute done\"', '\"DDNode sent delta timeTick\"', '\"assign segmentID for insert data success\"', '\"DescribeCollection done\"', '\"insertNode operator\"', '\"no shard cache for collection, try to get shard leaders from QueryCoord\"', '\"query msg can do\"', 'tr/waitTSafeTimeRecorder', 'tr/decodeSearchResults', '\"start do search\"', '\"Proxy dmTaskQueue popPChanStats\"', '\"tr/searchTask PostExecute\"', '\"Finish processing insert request in Proxy\"', '\"Search received\"', '\"tr/proxy execute search 441836845650673666\"', '\"received request to describe collection\"', '\"tr/proxy execute search 441836846358462466\"', '\"start do subsearch\"', 'tr/SubSearch', '\"Search enqueued\"', '\"Enqueue insert request in Proxy\"', '\"segments distribution\"', '\"translate output fields\"', '\"Proxy Insert PreExecute done\"', 'subSearchResultData', '\"Do insert done\"', '\"meta update: add allocation\"', '\"read target partitions on stream replica\"', '\"Filter invalid message in QueryNode\"', '\"Search done\"', '\"skip duplicated search result\"', '\"DescribeCollection received\"', '\"Proxy segIDAssigner reduceSegReqs after reduce\"', '\"Received SearchRequest\"', '\"send insert request to virtual channels\"', '\"search PreExecute done.\"', '\"Detail of insert request in Proxy\"', '\"all searches are finished or canceled\"', '\"DDNode receive insert messages\"', '\"Proxy segIDAssigner pickCanDoFunc\"', 'tr/reduceSearchResultData', 'tr/Search', '\"read target partitions on historical replica\"', '\"DescribeIndex received\"', '\"proxy receives one search result\"', '\"get shard leaders request received\"', '\"done to describe collection\"', '\"cluster segment distribution\"', '\"Start processing insert request in Proxy\"', '\"create query plan\"', '\"tr/proxy execute search 441836846030782465\"', 'tr/SearchShard', '\"success to assign segments\"', '\"do search on segment done\"', '\"meta update: add allocation - complete\"', '\"start do search on segment\"', '\"Updating segment number of rows\"', '\"search growing/sealed segments without indexes\"', 'Proxy::searchTask::PreExecute', '\"tr/proxy execute search 441836845283672066\"', '\"updating segment num row\"', '\"segment buffer status\"', '\"DescribeIndex done\"', '\"filter invalid insert message, collection mis-match\"'}

in bad not good {'\"node channel count is not much larger than average, skip reallocating\"', '\"Channel Balancer got new reAllocations:\"', '\"channel manager bg check balance\"'}

in both {'\"Get QueryNode component state done\"', 'tr/getEtcdMeta', 'Proxy.GetProxyMetrics', '\"IBN timetick log\"', '\"GC Tune done\"', '\"received get flushed segments request\"', 'tr/executeTxn', '\"IndexCoord describe index success\"', '\"rocksmq consume too slowly\"', '\"IndexCoord DescribeIndex\"', '\"current rates in proxy\"', 'tr/putEtcdMeta', '\"UpdateChannelCheckpoint success\"', '\"IndexCoord completeIndexInfo\"', '\"rocksmq produce too slowly\"', '\"getChannelCheckpoint for segment\"', '\"UpdateChannelCheckpoint done\"', '\"IndexCoord completeIndexInfo success\"', '\"IndexCoord get index states success\"', '\"tr/send msg to stream\"'}

JamesDeAntonis commented 1 year ago

The thing I'm most stuck on is why it has tended to happen at the same time every day. Is there any automatic job that may happen once a day?

suriyakode commented 1 year ago

One impactful design aspect to be noted is that we have a lot of collections (2k) with all just a little entries in them (most under 1024). They are all loaded in the memory simultaneously and all could be inserted/queried/deleted from at any time during prod. We did this to make each user have their own collection to query/insert from which we thought would make the actual search faster. And we noticed Milvus supports infinite number of collections. Is this bad design in terms of Milvus? Should we have one collection and have each user defined by a key which we include on add and filter for before querying? Is either more efficient?

suriyakode commented 1 year ago

This will also potentially scale to many many more user collections, each with between 50 to several thousands of entries.

suriyakode commented 1 year ago

Another thing to note is that I tried moving our deployment from milvus-standalone via docker compose on a local server to EKS cluster with replicas on kubernetes. I followed the guide on your page using helm chart and ran with couple replicas of each node and each coord.

I got the instance up and successfully migrated our quickly growing prod database into it, which is at around 150k entries, 2k collections. The data migration went through, however after we tried to create a new collection, the dataNodes were stuck in a loop where they would crash on repeat. However, because there were 5 replicas, the deployment was still slightly available, just nowhere suited for use in production.

The issue seems a lot like: https://github.com/milvus-io/milvus/issues/16114 https://github.com/milvus-io/milvus/issues/12243 Where I saw: "ProducerBlockedQuotaExceededError: Cannot create producer on topic with backlog quota exceeded"

I deleted the EKS cluster as I didn't want to spend money on it when it was down/not restarting properly, but will be trying to spin up and migrate data again, probably tomorrow. Will share results. Is there any logs I should look out for or way to set up metrics?

xiaofan-luan commented 1 year ago

1

Seems the mq insert and comsume is very slow. We have two concerns:

is this cluster running on SSD storage? what's the iops during the time 1pm? it could be affect other work load on the same machine or could be affect by compaction of rocksdb inside milvus but we need to figure out the load.
is there any specific you do insertion? or it's distributed evenly

xiaofan-luan commented 1 year ago

One impactful design aspect to be noted is that we have a lot of collections (2k) with all just a little entries in them (most under 1024). They are all loaded in the memory simultaneously and all could be inserted/queried/deleted from at any time during prod. We did this to make each user have their own collection to query/insert from which we thought would make the actual search faster. And we noticed Milvus supports infinite number of collections. Is this bad design in terms of Milvus? Should we have one collection and have each user defined by a key which we include on add and filter for before querying? Is either more efficient?

This design itself is not bad but milvus has a limitation of collection numbers and too many collections may has many side effects.

On Milvus 2.2.9 we introduced partition key, all users can share same collections while you can search with parititon key. It can support up to million level partitions, which would be good for your use cases.

xiaofan-luan commented 1 year ago

Another thing to note is that I tried moving our deployment from milvus-standalone via docker compose on a local server to EKS cluster with replicas on kubernetes. I followed the guide on your page using helm chart and ran with couple replicas of each node and each coord.

I got the instance up and successfully migrated our quickly growing prod database into it, which is at around 150k entries, 2k collections. The data migration went through, however after we tried to create a new collection, the dataNodes were stuck in a loop where they would crash on repeat. However, because there were 5 replicas, the deployment was still slightly available, just nowhere suited for use in production.

The issue seems a lot like: milvus-io/milvus#16114 milvus-io/milvus#12243 Where I saw: "ProducerBlockedQuotaExceededError: Cannot create producer on topic with backlog quota exceeded"

I deleted the EKS cluster as I didn't want to spend money on it when it was down/not restarting properly, but will be trying to spin up and migrate data again, probably tomorrow. Will share results. Is there any logs I should look out for or way to set up metrics?

Is this a pulsar cluster? from the very top I thought it's a rocksmq cluster. Are we on the same page?

suriyakode commented 1 year ago

1

Seems the mq insert and comsume is very slow. We have two concerns:

is this cluster running on SSD storage? what's the iops during the time 1pm? it could be affect other work load on the same machine or could be affect by compaction of rocksdb inside milvus but we need to figure out the load.

is there any specific you do insertion? or it's distributed evenly

I'll have to to get back to you tomorrow about storage. What we have observed is through htop, milvus starts taking up a lot of threads and cpu usage spikes heavily when the crash happens.

Insertion is evenly distributed across collections. In terms of when we do it, and to which collection.

suriyakode commented 1 year ago

Version 2.2.9 has not been released has it? Could you please point me toward the docs for how to use partition key?

suriyakode commented 1 year ago

Is this a pulsar cluster? from the very top I thought it's a rocksmq cluster. Are we on the same page?

Sorry if I wasn't clear. So I have tried so far with the same set of vector entries stored three different ways to try and remedy/debug the problem.

Server running milvus-standalone (eventually has rocksmq error every day)
ANOTHER identical server running milvus-standalone (also eventually has rocksmq error every day, but at different time). This shows that's it's not specific to any background task running on any server.
Milvus running on EKS cluster. This has the error that I described in my final message, where it's a pulsar error now.

The fact that it was still failing on remote hardware makes me think it's an issue with our workload? Could it be that too many collections are being loaded to memory?

Oh and to give some context on read/writes, the db is being used for a chatbot, so insert/queries would max happen every couple seconds to same user's collection, whenever the user writes a message.

xiaofan-luan commented 1 year ago

Is this a pulsar cluster? from the very top I thought it's a rocksmq cluster. Are we on the same page?

Sorry if I wasn't clear. So I have tried so far with the same set of vector entries stored three different ways to try and remedy/debug the problem.

Server running milvus-standalone (eventually has rocksmq error every day)

ANOTHER identical server running milvus-standalone (also eventually has rocksmq error every day, but at different time). This shows that's it's not specific to any background task running on any server.

Milvus running on EKS cluster. This has the error that I described in my final message, where it's a pulsar error now.

The fact that it was still failing on remote hardware makes me think it's an issue with our workload? Could it be that too many collections are being loaded to memory?

Oh and to give some context on read/writes, the db is being used for a chatbot, so insert/queries would max happen every couple seconds to same user's collection, whenever the user writes a message.

What is the storage medium your cluster is running? can you ensure this is SSD storage?
From your description I would see there is a strong possibliity that it is triggered by rocksdb compaction, we can do something to imrpove
Too many collections is definitely a issue. i would definitely recommend you to wait for today's release of 2.2.9 and use our partition features.

suriyakode commented 1 year ago

It appears to be NVME on local node. Which should pass the requirement in milvus doc

Ideally, your disk should reach over 500 IOPS and below 10ms for the 99th percentile fsync latency. Read the etcd Docs for more detailed requirements.


→ fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=2200m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=4296KiB/s][w=1912 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=2996438: Fri Jun  2 12:04:01 2023
write: IOPS=1832, BW=4116KiB/s (4215kB/s)(2200MiB/547283msec); 0 zone resets
clat (nsec): min=860, max=34410k, avg=116677.13, stdev=469182.86
lat (nsec): min=910, max=34410k, avg=116767.54, stdev=469182.66
clat percentiles (nsec):
|  1.00th=[    1464],  5.00th=[    1736], 10.00th=[    1768],
| 20.00th=[    1848], 30.00th=[    1944], 40.00th=[    3664],
| 50.00th=[  105984], 60.00th=[  134144], 70.00th=[  136192],
| 80.00th=[  193536], 90.00th=[  288768], 95.00th=[  292864],
| 99.00th=[  296960], 99.50th=[  301056], 99.90th=[ 4177920],
| 99.95th=[ 4358144], 99.99th=[21626880]
bw (  KiB/s): min= 2066, max= 4622, per=99.98%, avg=4115.29, stdev=532.45, samples=1094
iops        : min=  920, max= 2058, avg=1832.39, stdev=237.06, samples=1094
lat (nsec)   : 1000=0.05%
lat (usec)   : 2=31.79%, 4=10.74%, 10=1.25%, 20=0.02%, 50=0.01%
lat (usec)   : 100=0.01%, 250=42.43%, 500=13.50%, 750=0.01%, 1000=0.01%
lat (msec)   : 2=0.01%, 4=0.09%, 10=0.08%, 20=0.02%, 50=0.02%
fsync/fdatasync/sync_file_range:
sync (usec): min=371, max=81595, avg=427.83, stdev=456.73
sync percentiles (usec):
|  1.00th=[  379],  5.00th=[  383], 10.00th=[  383], 20.00th=[  383],
| 30.00th=[  383], 40.00th=[  388], 50.00th=[  388], 60.00th=[  388],
| 70.00th=[  388], 80.00th=[  392], 90.00th=[  400], 95.00th=[  709],
| 99.00th=[  717], 99.50th=[  725], 99.90th=[ 4424], 99.95th=[ 4752],
| 99.99th=[22676]
cpu          : usr=0.39%, sys=3.02%, ctx=3572066, majf=0, minf=15
IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,1002985,0,0 short=1002985,0,0,0 dropped=0,0,0,0
latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs): WRITE: bw=4116KiB/s (4215kB/s), 4116KiB/s-4116KiB/s (4215kB/s-4215kB/s), io=2200MiB (2307MB), run=547283-547283msec

Disk stats (read/write): nvme1n1: ios=563002/2006929, merge=0/889, ticks=111993/423137, in_queue=22116, util=100.00%


2. Is there anything we can do about that in the short term? Any settings we can change?
3. We will change our design though to use partition keys. It appears you cannot dynamically add them. So we would say set 100k partition keys when we create a collection, start using them for new users, and create a new collection after 100k users? Does this make sense/sound efficient?

suriyakode commented 1 year ago

Is there any limit on max number of partition keys?

xiaofan-luan commented 1 year ago

Is there any limit on max number of partition keys?

No, it can be up to million level partition keys

milvus-io / pymilvus