Open JamesDeAntonis opened 1 year ago
quick questions:
quick questions:
I didn't really conclude anything from this exercise, but after comparing logs from good times to logs from bad times, I saw these messages
in good not bad {'\"proxy search post execute reduce\"', '\"handle assign segment request\"', '\"Proxy Insert Execute done\"', 'reduceSearchResultData', '\"DescribeIndex enqueued\"', '\"Search Execute done\"', '\"Proxy Reduce Search Result\"', '\"Proxy segIDAssigner reduceSegReqs\"', '\"ScoreBasedBalancer has balanced collection, skip balancing in this round\"', '\"DescribeCollection enqueued\"', '\"updating segment memorySize\"', '\"Search post execute done\"', '\"DDNode sent delta timeTick\"', '\"assign segmentID for insert data success\"', '\"DescribeCollection done\"', '\"insertNode operator\"', '\"no shard cache for collection, try to get shard leaders from QueryCoord\"', '\"query msg can do\"', 'tr/waitTSafeTimeRecorder', 'tr/decodeSearchResults', '\"start do search\"', '\"Proxy dmTaskQueue popPChanStats\"', '\"tr/searchTask PostExecute\"', '\"Finish processing insert request in Proxy\"', '\"Search received\"', '\"tr/proxy execute search 441836845650673666\"', '\"received request to describe collection\"', '\"tr/proxy execute search 441836846358462466\"', '\"start do subsearch\"', 'tr/SubSearch', '\"Search enqueued\"', '\"Enqueue insert request in Proxy\"', '\"segments distribution\"', '\"translate output fields\"', '\"Proxy Insert PreExecute done\"', 'subSearchResultData', '\"Do insert done\"', '\"meta update: add allocation\"', '\"read target partitions on stream replica\"', '\"Filter invalid message in QueryNode\"', '\"Search done\"', '\"skip duplicated search result\"', '\"DescribeCollection received\"', '\"Proxy segIDAssigner reduceSegReqs after reduce\"', '\"Received SearchRequest\"', '\"send insert request to virtual channels\"', '\"search PreExecute done.\"', '\"Detail of insert request in Proxy\"', '\"all searches are finished or canceled\"', '\"DDNode receive insert messages\"', '\"Proxy segIDAssigner pickCanDoFunc\"', 'tr/reduceSearchResultData', 'tr/Search', '\"read target partitions on historical replica\"', '\"DescribeIndex received\"', '\"proxy receives one search result\"', '\"get shard leaders request received\"', '\"done to describe collection\"', '\"cluster segment distribution\"', '\"Start processing insert request in Proxy\"', '\"create query plan\"', '\"tr/proxy execute search 441836846030782465\"', 'tr/SearchShard', '\"success to assign segments\"', '\"do search on segment done\"', '\"meta update: add allocation - complete\"', '\"start do search on segment\"', '\"Updating segment number of rows\"', '\"search growing/sealed segments without indexes\"', 'Proxy::searchTask::PreExecute', '\"tr/proxy execute search 441836845283672066\"', '\"updating segment num row\"', '\"segment buffer status\"', '\"DescribeIndex done\"', '\"filter invalid insert message, collection mis-match\"'}
in bad not good {'\"node channel count is not much larger than average, skip reallocating\"', '\"Channel Balancer got new reAllocations:\"', '\"channel manager bg check balance\"'}
in both {'\"Get QueryNode component state done\"', 'tr/getEtcdMeta', 'Proxy.GetProxyMetrics', '\"IBN timetick log\"', '\"GC Tune done\"', '\"received get flushed segments request\"', 'tr/executeTxn', '\"IndexCoord describe index success\"', '\"rocksmq consume too slowly\"', '\"IndexCoord DescribeIndex\"', '\"current rates in proxy\"', 'tr/putEtcdMeta', '\"UpdateChannelCheckpoint success\"', '\"IndexCoord completeIndexInfo\"', '\"rocksmq produce too slowly\"', '\"getChannelCheckpoint for segment\"', '\"UpdateChannelCheckpoint done\"', '\"IndexCoord completeIndexInfo success\"', '\"IndexCoord get index states success\"', '\"tr/send msg to stream\"'}
The thing I'm most stuck on is why it has tended to happen at the same time every day. Is there any automatic job that may happen once a day?
One impactful design aspect to be noted is that we have a lot of collections (2k) with all just a little entries in them (most under 1024). They are all loaded in the memory simultaneously and all could be inserted/queried/deleted from at any time during prod. We did this to make each user have their own collection to query/insert from which we thought would make the actual search faster. And we noticed Milvus supports infinite number of collections. Is this bad design in terms of Milvus? Should we have one collection and have each user defined by a key which we include on add and filter for before querying? Is either more efficient?
This will also potentially scale to many many more user collections, each with between 50 to several thousands of entries.
Another thing to note is that I tried moving our deployment from milvus-standalone via docker compose on a local server to EKS cluster with replicas on kubernetes. I followed the guide on your page using helm chart and ran with couple replicas of each node and each coord.
I got the instance up and successfully migrated our quickly growing prod database into it, which is at around 150k entries, 2k collections. The data migration went through, however after we tried to create a new collection, the dataNodes were stuck in a loop where they would crash on repeat. However, because there were 5 replicas, the deployment was still slightly available, just nowhere suited for use in production.
The issue seems a lot like: https://github.com/milvus-io/milvus/issues/16114 https://github.com/milvus-io/milvus/issues/12243 Where I saw: "ProducerBlockedQuotaExceededError: Cannot create producer on topic with backlog quota exceeded"
I deleted the EKS cluster as I didn't want to spend money on it when it was down/not restarting properly, but will be trying to spin up and migrate data again, probably tomorrow. Will share results. Is there any logs I should look out for or way to set up metrics?
1
Seems the mq insert and comsume is very slow. We have two concerns:
One impactful design aspect to be noted is that we have a lot of collections (2k) with all just a little entries in them (most under 1024). They are all loaded in the memory simultaneously and all could be inserted/queried/deleted from at any time during prod. We did this to make each user have their own collection to query/insert from which we thought would make the actual search faster. And we noticed Milvus supports infinite number of collections. Is this bad design in terms of Milvus? Should we have one collection and have each user defined by a key which we include on add and filter for before querying? Is either more efficient?
This design itself is not bad but milvus has a limitation of collection numbers and too many collections may has many side effects.
On Milvus 2.2.9 we introduced partition key, all users can share same collections while you can search with parititon key. It can support up to million level partitions, which would be good for your use cases.
Another thing to note is that I tried moving our deployment from milvus-standalone via docker compose on a local server to EKS cluster with replicas on kubernetes. I followed the guide on your page using helm chart and ran with couple replicas of each node and each coord.
I got the instance up and successfully migrated our quickly growing prod database into it, which is at around 150k entries, 2k collections. The data migration went through, however after we tried to create a new collection, the dataNodes were stuck in a loop where they would crash on repeat. However, because there were 5 replicas, the deployment was still slightly available, just nowhere suited for use in production.
The issue seems a lot like: milvus-io/milvus#16114 milvus-io/milvus#12243 Where I saw: "ProducerBlockedQuotaExceededError: Cannot create producer on topic with backlog quota exceeded"
I deleted the EKS cluster as I didn't want to spend money on it when it was down/not restarting properly, but will be trying to spin up and migrate data again, probably tomorrow. Will share results. Is there any logs I should look out for or way to set up metrics?
Is this a pulsar cluster? from the very top I thought it's a rocksmq cluster. Are we on the same page?
1
Seems the mq insert and comsume is very slow. We have two concerns:
- is this cluster running on SSD storage? what's the iops during the time 1pm? it could be affect other work load on the same machine or could be affect by compaction of rocksdb inside milvus but we need to figure out the load.
- is there any specific you do insertion? or it's distributed evenly
I'll have to to get back to you tomorrow about storage. What we have observed is through htop, milvus starts taking up a lot of threads and cpu usage spikes heavily when the crash happens.
Insertion is evenly distributed across collections. In terms of when we do it, and to which collection.
Version 2.2.9 has not been released has it? Could you please point me toward the docs for how to use partition key?
Is this a pulsar cluster? from the very top I thought it's a rocksmq cluster. Are we on the same page?
Sorry if I wasn't clear. So I have tried so far with the same set of vector entries stored three different ways to try and remedy/debug the problem.
The fact that it was still failing on remote hardware makes me think it's an issue with our workload? Could it be that too many collections are being loaded to memory?
Oh and to give some context on read/writes, the db is being used for a chatbot, so insert/queries would max happen every couple seconds to same user's collection, whenever the user writes a message.
Is this a pulsar cluster? from the very top I thought it's a rocksmq cluster. Are we on the same page?
Sorry if I wasn't clear. So I have tried so far with the same set of vector entries stored three different ways to try and remedy/debug the problem.
- Server running milvus-standalone (eventually has rocksmq error every day)
- ANOTHER identical server running milvus-standalone (also eventually has rocksmq error every day, but at different time). This shows that's it's not specific to any background task running on any server.
- Milvus running on EKS cluster. This has the error that I described in my final message, where it's a pulsar error now.
The fact that it was still failing on remote hardware makes me think it's an issue with our workload? Could it be that too many collections are being loaded to memory?
Oh and to give some context on read/writes, the db is being used for a chatbot, so insert/queries would max happen every couple seconds to same user's collection, whenever the user writes a message.
Ideally, your disk should reach over 500 IOPS and below 10ms for the 99th percentile fsync latency. Read the etcd Docs for more detailed requirements.
→ fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=2200m --bs=2300 --name=mytest mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1 fio-3.16 Starting 1 process Jobs: 1 (f=1): [W(1)][100.0%][w=4296KiB/s][w=1912 IOPS][eta 00m:00s] mytest: (groupid=0, jobs=1): err= 0: pid=2996438: Fri Jun 2 12:04:01 2023 write: IOPS=1832, BW=4116KiB/s (4215kB/s)(2200MiB/547283msec); 0 zone resets clat (nsec): min=860, max=34410k, avg=116677.13, stdev=469182.86 lat (nsec): min=910, max=34410k, avg=116767.54, stdev=469182.66 clat percentiles (nsec): | 1.00th=[ 1464], 5.00th=[ 1736], 10.00th=[ 1768], | 20.00th=[ 1848], 30.00th=[ 1944], 40.00th=[ 3664], | 50.00th=[ 105984], 60.00th=[ 134144], 70.00th=[ 136192], | 80.00th=[ 193536], 90.00th=[ 288768], 95.00th=[ 292864], | 99.00th=[ 296960], 99.50th=[ 301056], 99.90th=[ 4177920], | 99.95th=[ 4358144], 99.99th=[21626880] bw ( KiB/s): min= 2066, max= 4622, per=99.98%, avg=4115.29, stdev=532.45, samples=1094 iops : min= 920, max= 2058, avg=1832.39, stdev=237.06, samples=1094 lat (nsec) : 1000=0.05% lat (usec) : 2=31.79%, 4=10.74%, 10=1.25%, 20=0.02%, 50=0.01% lat (usec) : 100=0.01%, 250=42.43%, 500=13.50%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.09%, 10=0.08%, 20=0.02%, 50=0.02% fsync/fdatasync/sync_file_range: sync (usec): min=371, max=81595, avg=427.83, stdev=456.73 sync percentiles (usec): | 1.00th=[ 379], 5.00th=[ 383], 10.00th=[ 383], 20.00th=[ 383], | 30.00th=[ 383], 40.00th=[ 388], 50.00th=[ 388], 60.00th=[ 388], | 70.00th=[ 388], 80.00th=[ 392], 90.00th=[ 400], 95.00th=[ 709], | 99.00th=[ 717], 99.50th=[ 725], 99.90th=[ 4424], 99.95th=[ 4752], | 99.99th=[22676] cpu : usr=0.39%, sys=3.02%, ctx=3572066, majf=0, minf=15 IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,1002985,0,0 short=1002985,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs): WRITE: bw=4116KiB/s (4215kB/s), 4116KiB/s-4116KiB/s (4215kB/s-4215kB/s), io=2200MiB (2307MB), run=547283-547283msec
Disk stats (read/write): nvme1n1: ios=563002/2006929, merge=0/889, ticks=111993/423137, in_queue=22116, util=100.00%
2. Is there anything we can do about that in the short term? Any settings we can change?
3. We will change our design though to use partition keys. It appears you cannot dynamically add them. So we would say set 100k partition keys when we create a collection, start using them for new users, and create a new collection after 100k users? Does this make sense/sound efficient?
Is there any limit on max number of partition keys?
Is there any limit on max number of partition keys?
No, it can be up to million level partition keys
Is there an existing issue for this?
What is your question?
Milvus is working just fine for our team, with the one exception being that around 1 p.m. every day, queries get insanely slow (like 15 minutes per query), and that slowness persists for about an hour before returning to normal. Why does this happen, and is there some config we can set to fix this? Thanks!
Anything else?
FYI we've been running Milvus as a standalone service