Fix doc: Content nodes going offline with "Cannot allocate memory" seen in logs

nehajatav commented 2 years ago

Describe the bug We are running the feeding client on a health cluster and after some time we the feeding stops with timeout error. Upon checking cluster controller, we find that the nodes go from Up to Down state continuously with significantly reduced bucket count until the cluster is down.

Upon checking the logs every time this occurs, we see a log line "Cannot allocate memory"
Upon checking the metrics we see that the percentage memory usage is <10% even though the memory used (24g rss) is greater than 50% of memory allocated to pod (32g limits.memory)

To Reproduce Steps to reproduce the behavior:

bring up a multi node cluster on a k8s cluster with very high capacity k8s nodes
run the feeder client for a few hours
feeder stops and the cluster starts going down with log suggesting 'cannot allocate memory'
run the metrics api

Expected behavior A clear and concise description of what you expected to happen.

Screenshots

Environment (please complete the following information): OS: Docker image vespa:7.559.12 Infrastructure: Kubernetes Versions Major:"1", Minor:"21", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64" Content pod: Limits: cpu: 8 memory: 32G Requests: cpu: 4 memory: 32G Container pod: Limits: cpu: 2 memory: 8G Requests: cpu: 2 memory: 8G Config pod: Limits: cpu: 3 memory: 16G Requests: cpu: 3 memory: 16G

Host and services.xml: same as in this comment: https://github.com/vespa-engine/vespa/issues/22315#issuecomment-1113074776

Vespa version 7.559.12 compiled with go1.16.13 on linux/amd64

Additional context Metrics: curl http://localhost:19107/state/v1/metrics on vespa-content-0 Line 23: "name": "content.proton.resource_usage.memory_usage.total 0.05039729495321713, Line 29: "name": "content.proton.resource_usage.memory_usage.total_utilization 0.05599699439246349, Line 35: "name": "content.proton.resource_usage.memory_usage.transient 0.001558360468134989, http://vespa-container.mynamespace.container.site/metrics/v2/values under hostname: "vespa-content-0.vespa-internal.mynamespace.svc.cluster.local", name: "vespa.searchnode", .... Line 15: memory_virt: 28203995136, Line 16: memory_rss: 27313659904,

content.proton.resource_usage.memory.average: 0.0488389344851,

name: "vespa.metricsproxy-container", ... memory_virt: 974213120, memory_rss: 462393344,

Logs on content node: 1651036765.278889 vespa-content-14.vespa-internal.mynamespace.svc.cluster.local 27681/57666 searchnode proton.transactionlog.domainpart error Short Read. Got only 3797 of 54651 bytes at pos 46620971 trying to read packet blob. File 'tls/tls/pwmdocumentstorelexlexDocument/schemalexDocument-0000000002732460' of size 46624768 has last error of 'Cannot allocate memory'. Truncate to 46620966 and continue 1651547994.844632 vespa-content-0.vespa-internal.mynamespace.svc.cluster.local 33288/59024 distributor vds.distributor.operation.idealstate.remove info Remove operation on bucket BucketId(0x4000000000001b76) failed. This distributor has already removed the bucket from the bucket database, so it is not possible to retry this operation. Failure code: ReturnCode(CONNECTION_ERROR, A connection error occurred for 'storage/cluster.lexdoc/storage/13'; (RPC) Connection error) 1651550591.011482 vespa-content-0.vespa-internal.mynamespace.svc.cluster.local 507645/29963 searchnode proton.searchlib.docstore.logdatastore warning We detected an empty idx file for part '/opt/vespa/var/db/vespa/search/cluster.lexdoc/n0/documents/schemalexDocument/2.notready/summary/1651548790702237710'. Erasing it. 1651550591.013684 vespa-content-0.vespa-internal.mynamespace.svc.cluster.local 507645/62653 searchnode proton.searchlib.docstore.logdatastore warning We detected an empty idx file for part '/opt/vespa/var/db/vespa/search/cluster.lexdoc/n0/documents/schemalexDocument/1.removed/summary/1651548790712490860'. Erasing it. :1650974130.589212 vespa-content-0.vespa-internal.mynamespace.svc.cluster.local 9418/54982 searchnode proton.persistence.util warning Slow processing of message DeleteBucketCommand(BucketId(0x4800000000020cfe)) Reasons to start: . Processing time: 5.1 s (>=5.0 s) 1651578437.817338 vespa-content-0.vespa-internal.mynamespace.svc.cluster.local 33292/35 metricsproxy-container Container.ai.vespa.metricsproxy.service.HttpMetricFetcher warning Unable to get response from service 'searchnode:33287:RESTARTING:lexdoc/search/cluster.lexdoc/0': Connect to http://localhost:19107 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused

1651550591.017910 vespa-content-0.vespa-internal.mynamespace.svc.cluster.local 507645/29963 searchnode proton.searchlib.docstore.logdatastore warning Removing dangling file '/opt/vespa/var/db/vespa/search/cluster.lexdoc/n0/documents/schemalexDocument/2.notready/summary/1651548790702237710.dat'

Logs on cluster controller: 1651226664.450724 vespa-0.vespa-internal.mynamespace.svc.cluster.local 1346/47 container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.EventLog info Added node only event: Event: storage.13: Altered node state in cluster state from 'D: Connection error: Closed at other end. (Node or switch likely shut down)' to 'U, t 1651226660' 1651226664.459546 vespa-0.vespa-internal.mynamespace.svc.cluster.local 1346/47 container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.EventLog info Cluster event type SYSTEMSTATE @1651226664450: New cluster state version 105. Change from last: version: 104 => 105, cluster: Down => Up, storage: 13: [Down => Up, startTimestamp: 0 => 1651226660, description: Connection error: Closed at other end. (Node or switch likely shut down) => ]

kkraune commented 2 years ago

Thanks for reporting. I understand this issue as, the node is actually out of memory (given the 27G rss on a 32G pod), but this metric is only at 5%:

content.proton.resource_usage.memory_usage.total 0.05039729495321713

maybe this metrics is calculated using the hosts memory @geirst in some cases?

baldersheim commented 2 years ago

That is correct, unless it is configured it will use return sysconf(_SC_PHYS_PAGES) * sysconf(_SC_PAGESIZE);

kkraune commented 2 years ago

Thanks @baldersheim ! I'll look into documenting this somewhere.

The _content.proton.resource_usage.memoryusage.total cannot be used in your current environment, @nehajatav to measure how loaded the content node is.

The log message "Cannot allocate memory" seems accurate though, so you know how many documents approximately such a pod can take before it is full, so you can use it in your sizing and make sure you do not overload nodes.

I will transform this into a documentation request then

nehajatav commented 2 years ago

I don't understand, does this mean that the feed block doesn't work with kubernetes?

That is correct, unless it is configured it will use

How can we configure this to true memory or hardcoded memory size instead of percentage of total memory?

kkraune commented 2 years ago

I don't understand, does this mean that the feed block doesn't work with kubernetes?

We have not tested it with Kubernetes, correct. Our recommendation is to manage this yourselves, as you now know the pod's capacity

nehajatav commented 2 years ago

Our recommendation is to manage this yourselves, as you now know the pod's capacity

Is there a way to block feed based on absolute values instead of relative? We only know and can control the pod's capacity. The underlying k8s node capacity may vary

nehajatav commented 2 years ago

Which cgroup property does vespa honour for memory limitation. When we set limits.memory to 64g, I see the below output on the pod:

bash-4.2$ cat /sys/fs/cgroup/memory/memory.limit_in_bytes 64000000000 bash-4.2$

baldersheim commented 2 years ago

In my comment further up you see the code we use for getting memory. That is probably code before cgroups/docker. Feel free to suggest a better approach. You find the code at https://github.com/vespa-engine/vespa/blob/master/searchcore/src/vespa/searchcore/proton/common/hw_info_sampler.cpp#L48

nehajatav commented 2 years ago

Thanks @baldersheim! I see the mention of that in proton.def. Is there a way to update below property using application deploy or do we need to tweak the same in DockerFile/ENTRYPOINT? If it is latter, is this the correct place to fiddle /opt/vespa/share/vespa/configdefinitions/vespa.config.search.core.proton.def?

The size of physical memory (in bytes) available to proton.

If set to 0, the memory size is sampled as _SC_PHYS_PAGES * _SC_PAGESIZE by using sysconf to do the lookup.

The memory size is used when determining if feed should be blocked in writefilter.

hwinfo.memory.size long default = 0 restart

baldersheim commented 2 years ago

The best solution is if you can figure out how we can programatically get the correct value. What to use instead of sysconf().... It is possible to do a config override, but that is not preferable.

kkraune commented 2 years ago

We made https://github.com/vespa-engine/vespa/issues/22444 to improve resource sensing - I will close this ticket once the doc is a bit refined

kkraune commented 2 years ago

with Vespa-7.586.113 the resource sensing is improved, please let us know of any issues

vespa-engine / vespa