Open shubh9194 opened 1 month ago
Do you use vespa-stop-services
when you restart? Read the doc about vespa-stop-services and the prepareRestart command, which might reduce startup time. If you do that and see that prepareRestart
times out, Vespa 8.406.26 and later have a higher timeout for this command, as we have seen that content nodes with a lot of data might need more time for this.
@hmusum I am updating the vespa version to one that you suggested. In the mean time, can you please help with this as well. Reindexing is also taking a lot of time and sometimes it seems it is stuck and don’t see any progress. In the content logs, i dont see any error but seeing a lot warnings like below.
Warning searchnode proton.persistence.util Slow processing of message Put (BucketId(0x300000000000000), id:default:docid, timestamp 1726503690000164, size 19266). processing time: 58.0 s (>=5.0) (Repeated 107 times since 1726503752.512184)
let me know if you need more logs
Sounds like you have a setup that cannot handle the feed load.
Hardware, local or remote disk, number of documents, # of and type of indexes, paged attributes, tensors/embeddings, HNSW indexes are some of the things that might affect this. You might want to look into https://docs.vespa.ai/en/performance/sizing-feeding.html to find what/how you can do something about this, if possible. Hard to say without knowing more about the application, schema details, HW etc.
@hmusum We have a single container and content node on different hosts. Content node host have 62GB RAM and around 400GB disk. In our cluster, right now we have deployed around 90 schemas out of which only around 30 schemas have documents for now. Total docs around 1.5m And 1 schema has more than 70% docs.
Will reducing the number of schemas help?
also week back reindexing succeeded once after node restart and around that time in logs, i have found some deadlock related logs
proton. deadlock.detector warning Thread Modified bucket checker 0x10308644200 has gone 296151 milliseconds without registering a tick. Global slack not expended yet. warning for now. Attempting to dump stack of thread... Backtrace:\n /lib64/1ibc.so. 6(syscall+0x1d) [0x7ff21aa919bd] \n /1ib64/1ibstdc++.so.6(std::__atomic_futex_unsigned_base::_M_futex_wait_until(unsigned int*, unsigned int, bool, std::chrono:: duration<long, std::ratio<1l, 1l> >, std::chrono: : duration<long, std: :ratio<1l, 10000000001> >+0xe1) [0X7ff21b4777c1]\n /opt/vespa/sbin/vespa-proton-bin (proton::PersistenceEngine::getModifiedBuckets(document: :BucketSpace) const+0x454) [0x5ffe24] \n /opt/vespa/lib64/libstorage.so(storage::ModifiedBucketChecker::requestModifiedBucketsFromProvider(document::BucketSpace)+0x32) [0х7ff229234932] /opt/vespa/11b64/11bstorage.sostorage: :ModifiedBucketChecker:: tick+0x1a3) [0x7ff229234d3] \n /opt/vespa/lib64/1ibstorage.so(storage::ModifiedBucketChecker:: run (storage:: framework:: ThreadHandle&)+Oxdb) [Ox7ff229234cb]\n /opt/vespa/lib64/libstorage.so(storage:: framework::defaultimplementation::ThreadImpl:: run (+0x20) [0x7ff22937b390] \n /lib64/libstdc++.so.6(+0xc2b23) [ox7ff21b479b23]\n /opt/vespa/lib64/vespa/malloc/libvespamalloc.so(+0x94de) [0x7ff22b2f64de]\n /1ib64/1ibpthread.so.0(+0x81ca) /lib64/libc.so.6(clone+0x43) [0x7ff21aa91e73]\n
Also when content node is restarted In logs, we transaction log replay progress is very slow. We also see high cpu usage and it remains until content node is up. cpu usage jumps from around 10% to 60 or even 80% sometimes and high page out rate. Let me know if you need any more logs or info.
@hmusum @kkraune We have reduced the number of schemas to less than 50 and updated vespa version to 8.406.26. Previously in order to make sure cluster is working fine, We used the reindex api to stop the reindexing of one the schema. After upgrading, I started the reindexing again but it always get stuck and we don't see any progress. During that time, cpu usage and page out rate is very high. And if we stop the vespa process and podman container during this time and try to restart them, vespa services takes lot of time come back up as transaction log replay is very slow for schema that is having the most docs.
reindexing is not completed for schema in last 2 weeks. And only way to make sure ingestions are working fine, we stopped the reindexing using reindex api. Dont see any error in the logs except lot of warnings like Warning searchnode proton.persistence.util Slow processing of message Put (BucketId(0x300000000000000), id:default:docid, timestamp 1726503690000164, size 19266). processing time: 58.0 s (>=5.0) (Repeated 107 times since 1726503752.512184)
RAM usage of host where content node is running is around 60% so we do have significant RAM available as well.
How do we find the where the actual bottleneck is during reindexing?
To avoid tx replay on startup, let vespa-stop-services compl,ete before stopping the container.
How many documents are in this schema, do you use HSNW indexes (how many), and are the vector attributes paged?
@bratseth we stop the vespa services before stopping the container. Will confirm on whether vespa-services are stopped properly.
But can you help with why reindexing is getting stuck. All ingestions are getting timed out as well unless we stop the reindexing using reindex api. During reindexing we see the logs filled with these warning msg Warning searchnode proton.persistence.util Slow processing of message Put (BucketId(0x300000000000000)
Is it reindexing that is the problem, or is it replaying the tx log at startup?
How many documents are in this schema, do you use HSNW indexes (how many), and are the vector attributes paged?
@bratseth We observed this issue started with reindexing and tx log start up issue also seen when node is restarted during reindexing. Any other time, after node restart node comes back in few mins.
So we think reindexing is the root cause of all this and want to understand what is the bottleneck for reindexing and why it is getting stuck for one schema. That schema has around 1.36 m docs. And yes this schema has more than 2000 index fields and good number of paged attributes.
Are the fields that are HNSW indexed paged?
@bratseth no, no HNSW indexed paged
Describe the bug we have a cluster with 1 container and 1 content node with almost 85 schemas. We have observed that when we restart the content node or when content node is restarted due to any reason., it takes a long time to come back up. Also cluster is storing and reading data from encrypted path.
vespa version : 8.388.11
Expected behavior content node should be up witth in few mins max and not take hours.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Vespa version vespa version : 8.388.11
Additional context Add any other context about the problem here.
Will be sharing the logs seperately