opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
8.83k stars 1.62k forks source link

[Searchable Snapshots] CAT allocation API reports wrong disk usage #13502

Open rlevytskyi opened 2 weeks ago

rlevytskyi commented 2 weeks ago

Describe the bug

While implementing searchable snapshot feature, we ran into search node disk exhaustion. Trying to narrow it down, I noticed incorrect usage report by _cat/allocation API: % curl logs:9200/_cat/allocation\?v\&s=host shards disk.indices disk.used disk.avail disk.total disk.percent host 2296 2.3tb 2.3tb 1tb 3.4tb 68 v80.co.com 2297 2.3tb 2.3tb 1.1tb 3.4tb 68 v81.co.com 2296 2.3tb 2.3tb 1tb 3.4tb 68 v82.co.com 2296 2.4tb 2.4tb 1012.1gb 3.4tb 71 v83.co.com 445 639.4gb 31.4gb 96.4gb 127.9gb 24 v87.co.com

As you can see, it reports that search node has 128GB disk and 31GB of it is in use.

However, the disk usage at the OS is quite different: v87:[/opt/os-search]# df -h /mnt/search/ Filesystem Size Used Avail Use% Mounted on /dev/xvdc1 128G 104G 25G 81% /mnt/search

I.e. 104GB is used out of 128GB, most of them (103GB) is the '/mnt/search/nodes' directory.

Related component

Other

To Reproduce

  1. Install search node with dedicated data disk;
  2. Restore some indices as "remote_snapshot";
  3. Compare disk usage by '_cat/allocation' URL and by the OS 'df' output.

Expected behavior

'_cat/allocation' URL have to produce correct ouput.

Additional Details

Additional context

By the way, it seems that limiting cache size has no effect (maybe due to incorrect report?). Here is the current limit: v87:[/opt/os-search]# grep cache config/opensearch.yml node.search.cache.size: 102gb

andrross commented 2 weeks ago

Related issue: #11676

andrross commented 1 week ago

[Triage - attendees 1 2 3 4] @rlevytskyi Thanks for filing