opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.89k stars 1.84k forks source link

[Feature Request] Support fs health check monitor on Azure Blob Storage / S3 #16743

Open audunsolemdal opened 4 days ago

audunsolemdal commented 4 days ago

Is your feature request related to a problem? Please describe

Currently I running the opensearch helm chart on kubernetes monitor.fs.health.enabled = true My node uses azure blob storage as the backend which seems to work fine, but the fs health check seems to fail

[ERROR][o.o.m.f.FsHealthService ] [datalev-opensearch-master-0] health check of [/usr/share/opensearch/data/nodes/0] failed

Describe the solution you'd like

Ideally a health check which works on azure blob storage / Amazon S3 storage.

Related component

Other

Describe alternatives you've considered

Current workaround is setting monitor.fs.health.enabled = false

Additional context

Error log

2024-11-29 10:15:54.713 [2024-11-29T09:15:54,712][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [opensearch-master-0] Finished housekeeping task for auto refresh streaming jobs.
2024-11-29 10:15:54.712 [2024-11-29T09:15:54,711][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [opensearch-master-0] Starting housekeeping task for auto refresh streaming jobs.
2024-11-29 10:15:53.436 [2024-11-29T09:15:53,436][INFO ][o.o.j.s.JobSweeper       ] [opensearch-master-0] Running full sweep
2024-11-29 10:15:53.336     at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
2024-11-29 10:15:53.336     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
2024-11-29 10:15:53.336     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
2024-11-29 10:15:53.336     at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.18.0.jar:2.18.0]
2024-11-29 10:15:53.336     at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) [opensearch-2.18.0.jar:2.18.0]
2024-11-29 10:15:53.336     at org.opensearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:246) [opensearch-2.18.0.jar:2.18.0]
2024-11-29 10:15:53.336     at org.opensearch.monitor.fs.FsHealthService$FsHealthMonitor.run(FsHealthService.java:195) [opensearch-2.18.0.jar:2.18.0]
2024-11-29 10:15:53.336     at org.opensearch.monitor.fs.FsHealthService$FsHealthMonitor.monitorFSHealth(FsHealthService.java:228) [opensearch-2.18.0.jar:2.18.0]
2024-11-29 10:15:53.336     at java.base/sun.nio.ch.ChannelOutputStream.close(ChannelOutputStream.java:111) ~[?:?]
2024-11-29 10:15:53.336     at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:113) ~[?:?]
2024-11-29 10:15:53.336     at java.base/sun.nio.ch.FileChannelImpl.implCloseChannel(FileChannelImpl.java:210) ~[?:?]
2024-11-29 10:15:53.336     at java.base/jdk.internal.ref.PhantomCleanable.clean(PhantomCleanable.java:133) ~[?:?]
2024-11-29 10:15:53.336     at java.base/jdk.internal.ref.CleanerImpl$PhantomCleanableRef.performCleanup(CleanerImpl.java:178) ~[?:?]
2024-11-29 10:15:53.336     at java.base/sun.nio.ch.FileChannelImpl$Closer.run(FileChannelImpl.java:116) ~[?:?]
2024-11-29 10:15:53.336     at java.base/java.io.FileDescriptor$1.close(FileDescriptor.java:89) ~[?:?]
2024-11-29 10:15:53.336     at java.base/java.io.FileDescriptor.close(FileDescriptor.java:304) ~[?:?]
2024-11-29 10:15:53.336     at java.base/java.io.FileDescriptor.close0(Native Method) ~[?:?]
2024-11-29 10:15:53.336 java.io.IOException: Input/output error
2024-11-29 10:15:53.336 [2024-11-29T09:15:53,335][ERROR][o.o.m.f.FsHealthService  ] [opensearch-master-0] health check of [/usr/share/opensearch/data/nodes/0] failed
andrross commented 1 day ago

Here is what the health check does: https://github.com/opensearch-project/OpenSearch/blob/2.x/server/src/main/java/org/opensearch/monitor/fs/FsHealthService.java#L223-L229

tl;dr: create a file, write a byte, fsync, close file, delete file

This is all pretty straightforward stuff using the Java NIO API. I would expect anything that is acting as a filesystem to need to work for these APIs.

@audunsolemdal It looks like your stack trace is pointing to a failure when attempting to close the OutputStream that was used to write a byte. Any idea why that might fail?

audunsolemdal commented 13 hours ago

This is all pretty straightforward stuff using the Java NIO API. I would expect anything that is acting as a filesystem to need to work for these APIs.

@audunsolemdal It looks like your stack trace is pointing to a failure when attempting to close the OutputStream that was used to write a byte. Any idea why that might fail?

I am not sure why it fails on that step, but there are some limitations in Azure blob storage compared to a full fledged file system. I am mounting a Kubernetes Persistent Volume via the Azure Blob Storage CSI driver, which is based on Blobfuse2

https://github.com/Azure/azure-storage-fuse?tab=readme-ov-file#un-supported-file-system-operations

So far I have not noticed any issues using this apart from the health check.