[RFC] Parquet/Avro Storage Extension With External Writer

samuel-oci commented 6 months ago

Is your feature request related to a problem? Please describe

I am interested in extending DocValues and StoredFields codecs to using a format such as Parquet or Avro. The main reasoning behind it is that those are highly popular formats that can be easily read by other popular projects such as Apache Spark.

There are currently a number of ways to do so:

Storage Plugin extension - This one is the most intuitive immediate solution. However, it ends up porting many of the parquet transient dependencies such as Hadoop and Guava into the OpenSearch runtime. This results in JarHell and other runtime issues that are making this process very difficult and could interfere with other plugins as well.
JNI - It's possible to extend the Lucene codec via JNI calls and therefore perform the writes in a different runtime. However, in doesn't seem intuitive and seem to be adding an additional layer of complexity. For example the ParquetWriter is a Java library, going to JNI->C++->Java seems like an unnecessary step.
Extensions - The extensions framework currently does not work for storage/engine extension. Moreover, by design it seems mainly geared towards offloading of compute to other nodes, but doesn't seem like an appropriate solution for collocated low level shared memory type operations like disk writes. Moreover, the extensions SDK framework still contain many runtime dependencies that can collide with extended storage plugins.

Describe the solution you'd like

I have created a POC that seems to work well in extending to both Avro and Parquet by leveraging the approach of external writer that is spawn by the main OpenSearch engine. The engine is communicating to the external writer via IPC based on system sockets. This presents a number of advantages:

Clean runtime - No runtime collisions.
Collocation and management - by spawning the external writer from the main process and wrapping around it as a resource with a lifecycle that binds to the engine, it is easier to manage and doesn't require any special knowledge in the system around it.
Flexibility To Integrate Non Java Implementations - In the future we can also integrate in a similar way, codec implementations written in non Java languages such as Python/Rust etc..

Related component

Storage

Describe alternatives you've considered

I have considered extending the core engine interface itself to be format agnostic (not dependent on Lucene APIs). However the engine interfaces are tightly bound to the Lucene spec. For example it relies on segmentInfos etc.. Since those APIs are generic enough I didn't see a need in replicating those into a non Lucene API spec.

Additional context

No response

andrross commented 6 months ago

Storage Plugin extension

Can you be more specific about what you mean here? I know there are the EnginePlugin, IndexStorePlugin, and RepositoryPlugin plugins all somewhat related to "storage".

transient dependencies such as Hadoop and Guava into the OpenSearch runtime

Plugins are loaded in a separate classloader partly to avoid these problems, as I understand it. For example, the repository-hdfs plugin currently brings in both Hadoop and Guava dependencies.

At first glance, it seems to me that the existing plugin framework should be able to work for this case (meaning the overall framework, not necessarily that the existing interfaces provide the exact extension points you need). Do you have a strong requirement to run this as a non-JVM runtime?

samuel-oci commented 6 months ago

Can you be more specific about what you mean here? I know there are the EnginePlugin, IndexStorePlugin, and RepositoryPlugin plugins all somewhat related to "storage".

I think a good way to illustrate the issue is if we look at the KNN plugin. It's trying to extend the knnVectorsFormat in the Lucene interface to a native codec implementation. For that it has to override Lucene consumer/producer interface of the knnVectorsFormat. For reference: here To replace a codec you have to create a new codec service in EnginePlugin. The approach taken by the KNN plugin works well for it because at the end of the day it's a native library and going through JNI is the natural choice.

Plugins are loaded in a separate classloader partly to avoid these problems, as I understand it. For example, the repository-hdfs plugin currently brings in both Hadoop and Guava dependencies.

The JarHell tool will scan the plugin dependencies and look for collisions. Sometimes the dependencies themselves have internal collisions. But not much you can do about it. At least that's what I noticed when I was trying to include ParquetWriter.

At first glance, it seems to me that the existing plugin framework should be able to work for this case (meaning the overall framework, not necessarily that the existing interfaces provide the exact extension points you need). Do you have a strong requirement to run this as a non-JVM runtime?

True, I commented in the proposal as well that there is no technical limitation at the moment that mandates an external runtime for this plugin. Given sufficient time and effort it is possible to fix most issues and get it to work. I did notice in my POC however, that it was a lot faster and cleaner to leverage the external writer approach. I ran to many JarHell issues just by including the ParquetWriter and Avro dependencies and had even more issues with the Hadoop FileSystem missing some esoteric configurations (when included in OpenSearch runtime) that I didn't encounter while starting a clean project. Then it occurred to me to propose this here as a potential extension mechanism for low level codec extensions that would otherwise require something like JNI.

andrross commented 6 months ago

Sometimes the dependencies themselves have internal collisions. But not much you can do about it.

I think JarHell is telling you that there are two implementations of the same fully qualified class name on the classpath. You would want to fix this regardless of how you were running that JVM since classloading is non-deterministic in this case.

Then it occurred to me to propose this here as a potential extension mechanism for low level codec extensions that would otherwise require something like JNI.

I would need more details on the specific implementation of this approach to really comment on it (please do share some version of your POC, even if only the OpenSearch changes, if you can). But I would expect that you'd have to introduce some kind of extension point into the core here, and then provide a no-op implementation for it by default, and optionally the IPC version of it that talks to an external process. It seems this might look a lot like the existing plugin framework? And once you have the plugin extension point you could provide an IPC-based implementation or an in-JVM implementation of it.

samuel-oci commented 6 months ago

I think JarHell is telling you that there are two implementations of the same fully qualified class name on the classpath. You would want to fix this regardless of how you were running that JVM since classloading is non-deterministic in this case.

While I agree this is the right practice. Unfortunately if two implementations of the same class in the underlying dependency it's not always easy to fix without modifying its source code.

I would need more details on the specific implementation of this approach to really comment on it (please do share some version of your POC, even if only the OpenSearch changes, if you can). But I would expect that you'd have to introduce some kind of extension point into the core here, and then provide a no-op implementation for it by default, and optionally the IPC version of it that talks to an external process. It seems this might look a lot like the existing plugin framework? And once you have the plugin extension point you could provide an IPC-based implementation or an in-JVM implementation of it.

I understand it could be hard to visualize. I will try to provide a draft of the POC pretty soon to show the example.

andrross commented 6 months ago

I am interested in extending DocValues and StoredFields codecs to using a format such as Parquet or Avro. The main reasoning behind it is that those are highly popular formats that can be easily read by other popular projects such as Apache Spark.

Can you also expand a bit on how the overall feature would be used? Namely, how does this approach compare to writing Parquet files in tandem with indexing into OpenSearch at ingestion time by using a tool like Data Prepper?

Bukhtawar commented 6 months ago

Thanks @samuel-oci. I definitely like the idea as it is almost inline with #12948. However, like @andrross I am curious too as to what does the extension point look like based on the issues you describe you ran into while setting up the writer.

Also I think if we can make one open format work for search queries requiring source fields, I see that as a win since we might end up saving some redundant storage cost.

peternied commented 6 months ago

[Triage - attendees 1 2 3 4 5 6 7 8] @samuel-oci Thanks for creating this RFC

samuel-oci commented 6 months ago

@andrross @Bukhtawar Perhaps I didn't make it too clear earlier, but one comment I forgot to put in the description regarding the motivation. If in the future we have Lucene extension in Python/Rust etc.. I believe integration on the storage encoding level can use a similar solution. So even if we can eventually get Engine plugin to work with Parquet/Avro, I think this also achieves a greater flexibility in incorporating new implementations of Lucene codec.

I edited now and added it in the advantages section in the description.

opensearch-project / OpenSearch