Open samuel-oci opened 6 months ago
Storage Plugin extension
Can you be more specific about what you mean here? I know there are the EnginePlugin
, IndexStorePlugin
, and RepositoryPlugin
plugins all somewhat related to "storage".
transient dependencies such as Hadoop and Guava into the OpenSearch runtime
Plugins are loaded in a separate classloader partly to avoid these problems, as I understand it. For example, the repository-hdfs
plugin currently brings in both Hadoop and Guava dependencies.
At first glance, it seems to me that the existing plugin framework should be able to work for this case (meaning the overall framework, not necessarily that the existing interfaces provide the exact extension points you need). Do you have a strong requirement to run this as a non-JVM runtime?
Can you be more specific about what you mean here? I know there are the EnginePlugin, IndexStorePlugin, and RepositoryPlugin plugins all somewhat related to "storage".
I think a good way to illustrate the issue is if we look at the KNN plugin. It's trying to extend the knnVectorsFormat in the Lucene interface to a native codec implementation. For that it has to override Lucene consumer/producer interface of the knnVectorsFormat. For reference: here To replace a codec you have to create a new codec service in EnginePlugin. The approach taken by the KNN plugin works well for it because at the end of the day it's a native library and going through JNI is the natural choice.
Plugins are loaded in a separate classloader partly to avoid these problems, as I understand it. For example, the repository-hdfs plugin currently brings in both Hadoop and Guava dependencies.
The JarHell tool will scan the plugin dependencies and look for collisions. Sometimes the dependencies themselves have internal collisions. But not much you can do about it. At least that's what I noticed when I was trying to include ParquetWriter
.
At first glance, it seems to me that the existing plugin framework should be able to work for this case (meaning the overall framework, not necessarily that the existing interfaces provide the exact extension points you need). Do you have a strong requirement to run this as a non-JVM runtime?
True, I commented in the proposal as well that there is no technical limitation at the moment that mandates an external runtime for this plugin. Given sufficient time and effort it is possible to fix most issues and get it to work. I did notice in my POC however, that it was a lot faster and cleaner to leverage the external writer approach. I ran to many JarHell issues just by including the ParquetWriter and Avro dependencies and had even more issues with the Hadoop FileSystem missing some esoteric configurations (when included in OpenSearch runtime) that I didn't encounter while starting a clean project. Then it occurred to me to propose this here as a potential extension mechanism for low level codec extensions that would otherwise require something like JNI.
Sometimes the dependencies themselves have internal collisions. But not much you can do about it.
I think JarHell is telling you that there are two implementations of the same fully qualified class name on the classpath. You would want to fix this regardless of how you were running that JVM since classloading is non-deterministic in this case.
Then it occurred to me to propose this here as a potential extension mechanism for low level codec extensions that would otherwise require something like JNI.
I would need more details on the specific implementation of this approach to really comment on it (please do share some version of your POC, even if only the OpenSearch changes, if you can). But I would expect that you'd have to introduce some kind of extension point into the core here, and then provide a no-op implementation for it by default, and optionally the IPC version of it that talks to an external process. It seems this might look a lot like the existing plugin framework? And once you have the plugin extension point you could provide an IPC-based implementation or an in-JVM implementation of it.
I think JarHell is telling you that there are two implementations of the same fully qualified class name on the classpath. You would want to fix this regardless of how you were running that JVM since classloading is non-deterministic in this case.
While I agree this is the right practice. Unfortunately if two implementations of the same class in the underlying dependency it's not always easy to fix without modifying its source code.
I would need more details on the specific implementation of this approach to really comment on it (please do share some version of your POC, even if only the OpenSearch changes, if you can). But I would expect that you'd have to introduce some kind of extension point into the core here, and then provide a no-op implementation for it by default, and optionally the IPC version of it that talks to an external process. It seems this might look a lot like the existing plugin framework? And once you have the plugin extension point you could provide an IPC-based implementation or an in-JVM implementation of it.
I understand it could be hard to visualize. I will try to provide a draft of the POC pretty soon to show the example.
I am interested in extending DocValues and StoredFields codecs to using a format such as Parquet or Avro. The main reasoning behind it is that those are highly popular formats that can be easily read by other popular projects such as Apache Spark.
Can you also expand a bit on how the overall feature would be used? Namely, how does this approach compare to writing Parquet files in tandem with indexing into OpenSearch at ingestion time by using a tool like Data Prepper?
Thanks @samuel-oci. I definitely like the idea as it is almost inline with #12948. However, like @andrross I am curious too as to what does the extension point look like based on the issues you describe you ran into while setting up the writer.
Also I think if we can make one open format work for search queries requiring source fields, I see that as a win since we might end up saving some redundant storage cost.
@andrross @Bukhtawar Perhaps I didn't make it too clear earlier, but one comment I forgot to put in the description regarding the motivation. If in the future we have Lucene extension in Python/Rust etc.. I believe integration on the storage encoding level can use a similar solution. So even if we can eventually get Engine plugin to work with Parquet/Avro, I think this also achieves a greater flexibility in incorporating new implementations of Lucene codec.
I edited now and added it in the advantages section in the description.
Is your feature request related to a problem? Please describe
I am interested in extending DocValues and StoredFields codecs to using a format such as Parquet or Avro. The main reasoning behind it is that those are highly popular formats that can be easily read by other popular projects such as Apache Spark.
There are currently a number of ways to do so:
JarHell
and other runtime issues that are making this process very difficult and could interfere with other plugins as well.Describe the solution you'd like
I have created a POC that seems to work well in extending to both Avro and Parquet by leveraging the approach of external writer that is spawn by the main OpenSearch engine. The engine is communicating to the external writer via IPC based on system sockets. This presents a number of advantages:
Related component
Storage
Describe alternatives you've considered
I have considered extending the core engine interface itself to be format agnostic (not dependent on Lucene APIs). However the engine interfaces are tightly bound to the Lucene spec. For example it relies on
segmentInfos
etc.. Since those APIs are generic enough I didn't see a need in replicating those into a non Lucene API spec.Additional context
No response