Open Bukhtawar opened 7 months ago
@Bukhtawar - Thanks for the proposal. Format could mean two things here, i) format of the data represented as part of the document, ii) format of the data in rest (compressed and stored). Currently index codec defines both, are you suggesting to change both or just the first one?
Thanks @backslasht here I intend to keep the data stored at rest in a format that makes it easier for diverse query engines to be plugged in and helps data break free from the Lucene version compatibility constraints as much as possible.
Looping @reta @andrross @msfroh @tharejas @sachinpkale @gbbafna for thoughts
Nice proposal!
I am trying to understand scope of this feature request with following questions:
For my understanding, is the source
field part of Lucene segment today? if yes, even if we change its type from special field to a neutral type, say JSON, we still need Lucene to read the field first, right? Or are we proposing to store the source
independent of segments?
Use a query engine of their choice to query original data, even if the Lucene data formats changed
Does querying original data from another query engine bypass OpenSearch or this also means OpenSearch support pluggable query engines?
As far as I remember, the source
field could be disabled (and often is), so it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?
+1 . I like the overall idea of decoupling the source from the engine. Couple of questions/thoughts
it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?
@reta @Bukhtawar There indeed seems to be some overlap here with an ingestion tool like data prepper where you can configure another sink along side OpenSearch and store the data in a neutral, analytics-friendly format. The two use cases listed in this issue ("use any query engine" and "reindex seamlessly") could be solved by ingesting the original data into an additional sink. However, in that case OpenSearch has no knowledge of the other data and cannot use it the way that it uses the source
field today. It's an interesting thought to consider if we can replace the existing source
field that OpenSearch knows about and uses with a neutral, more future-proof format and kind of get the best of both worlds.
Thanks @Bukhtawar for the proposal.
I definitely see the value of storing _source field in a data format (considering it is just document blob) which is not bound to lucene engine version, especially for re-indexing..
As far as I remember, the
source
field could be disabled (and often is), so it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?
Thats true, I don't think you can rely on the _source field, since it can be disabled. https://www.elastic.co/guide/en/elasticsearch/reference/7.11/mapping-source-field.html#disable-source-field
As far as I remember, the
source
field could be disabled (and often is), so it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?Thats true, I don't think you can rely on the _source field, since it can be disabled. https://www.elastic.co/guide/en/elasticsearch/reference/7.11/mapping-source-field.html#disable-source-field
Obviously we are talking about the new data format which will be applicable for newer version onwards. Based on how the proposal goes we can always decide to change that if we see good benefits espl as OpenSearch has good support for durability but gets constrained on data compatibility
A few thoughts:
Question: @Bukhtawar Do we think that storing the original docs outside of Lucene would enable us to compress them better, reducing the burden of storage?
Hi @Bukhtawar that's a very interesting suggestion! Some clarification questions to make sure I get it right:
_source
is a codec that extends StoredFieldFormat in Lucene. Are you suggesting to move entirely from Lucene interface of StoredFieldFormat into a new interface?Context: I currently have a working POC in which I extended the _source
field to work with Parquet format. I have done so by extending the StoredFieldFormat in Lucene interfaces. I would love to share any cons/pros I have seen.
Is your feature request related to a problem? Please describe
While writing data in Lucene format enables faster queries, it also limits queries to use a compatible Lucene query engine. As data grows over time the need to keep engine compatible with the older data format imposes another constraint, preventing users to choose between getting benefits from newer versions vs keeping older format data readable. Then in order to upgrade the engine, the data indexed in older formats need to be re-indexed, which requires data to be read from the source field with a compatible Lucene engine before individual documents can be re-indexed into a target version.
Describe the solution you'd like
The
source
field stores the raw doc as a spl field, however this field can only be read by a compatible Lucene version. It be good if we could store this field in open/neutral format. This would enable users toThere could be caveats though with the query performance where actual doc needs to be returned, based on the data format, which needs to be evaluated further
Related component
Storage
Describe alternatives you've considered
No response
Additional context
No response