jprante / elasticsearch-knapsack

Knapsack plugin is an import/export tool for Elasticsearch
Apache License 2.0
472 stars 77 forks source link

Export index fields not included in _source #82

Closed nistvan86 closed 9 years ago

nistvan86 commented 9 years ago

In the mapping of Elasticsearch documents you can exclude fields from the _source field. This still allows searching in those fields but the original content is never represented in the search result.

For example:

{
  "paragraph": {
    "_source": {
      "excludes": ["fullText"]
    },
    "properties": {
      "fulltext": {
        "type": "string",
        "term_vector": "yes"
      },
      "title": {
        "type": "string"
      }
    }
  }
}

I'm experimenting with Knapsack on an index like this and noticed that it doesn't export any binary index data, only the settings, mappings and the indexed documents' _source field contents, which leads to data loss in this case.

I understand that exporting binary data requires exact version compatibility of Elasticsearch nodes, but if this is not an issue, can it be done somehow?

jprante commented 9 years ago

See https://github.com/jprante/elasticsearch-knapsack#export-search-results where I show how additional fields can be specified, like _parent. It's exactly like a search request.

Note, the fields must have mapping option store set to true, otherwise, Elasticsearch can not retrieve them.

If you want a low level index backup tool, which does not care about fields and their mappings, the built-in snapshot/restore feature of Elasticsearch is the best choice.

nistvan86 commented 9 years ago

Thanks for describing the available options.

Sadly none of these fits my needs. Knapsack's selective exporting combined with it's hot standby like data synchronization feature between nodes what I'm looking for. But it's a requirement that i cannot store some fields' original value in my index in a form that it can be retrieved. Still, the indexed data cannot be lost during export.

nistvan86 commented 9 years ago

Gave up on Elasticsearch in the end and wrote my own JSON exporter for Lucene based on the CodecReader API which is capable of doing this. Closing.