High memory usage with langdetect

We're trying to use langdetect for ES 2.4.4 for our project but we're running into a very high memory usage. Without the langdetect we have usually only a memory footprint of 1 to 2gb. With langdetect we have already 5 to 7gb in our dev environment that doesn't even has any real load on it.

Honestly, I have no idea if it is normal, if it is a bug or if we do something wrong with our schema. So is this normal? Also we don't have billions of documents, actually we don't have much data but we need to be able to search it well and get quality results. Which is the main reason why we use ES for this project and langdetect because we depends a lot on multi-lingual content.

Our dev cluster, it's a mirror of the production system:

elasticsearch.yml:

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please see the documentation for further information on configuration options:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/setup-configuration.html>
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
# cluster.name: my-application
#
cluster.name: search.wa-network.ch

# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
# node.name: node-1

node.name: {{with node}}{{.Node.Node}}{{end}}

#
# Add custom attributes to the node:
#
# node.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
# path.data: /path/to/data
#
# Path to log files:
#
# path.logs: /path/to/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
# bootstrap.memory_lock: true
#
# Make sure that the `ES_HEAP_SIZE` environment variable is set to about half the memory
# available on the system and that the owner of the process is allowed to use this limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
# network.host: 192.168.0.1
network.host: {{with node}}{{.Node.Address}}{{end}}

#
# Set a custom port for HTTP:
#
# http.port: 9200
#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html>
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.unicast.hosts: ["172.16.1.101", "172.16.1.102", "172.16.1.103", "172.16.1.160", "172.16.1.161"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of nodes / 2 + 1):
#
# discovery.zen.minimum_master_nodes: 3
#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery.html>
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
# gateway.recover_after_nodes: 3
#
# For more information, see the documentation at:
# <http://www.elastic.co/guide/en/elasticsearch/reference/current/modules-gateway.html>
#
# ---------------------------------- Various -----------------------------------
#
# Disable starting multiple nodes on a single system:
#
# node.max_local_storage_nodes: 1
#
# Require explicit names when deleting indices:
#
# action.destructive_requires_name: true
bootstrap.mlockall: true

discovery.zen.fd.ping_timeout: 30s

discovery.zen.minimum_master_nodes: 2

Mapping:

This is just one of our types, I can post mappings for other as well if needed.

  "curricula_vitaes": {
    "properties": {
      "ja_seeker_pic_id": {
        "analyzer": "ja_analyzer",
        "type": "string"
      },
      "city": {
        "languages": [
          "de",
          "en",
          "ja"
        ],
        "analyzer": "_keyword",
        "position_increment_gap": 100,
        "language_to": {
          "de": "de_city",
          "ja": "ja_city",
          "en": "en_city"
        },
        "type": "langdetect"
      },
      "de_description": {
        "analyzer": "german",
        "type": "string"
      },
      "en_seeker_portfolio_id": {
        "analyzer": "english",
        "type": "string"
      },
      "description": {
        "languages": [
          "de",
          "en",
          "ja"
        ],
        "analyzer": "_keyword",
        "position_increment_gap": 100,
        "language_to": {
          "de": "de_description",
          "ja": "ja_description",
          "en": "en_description"
        },
        "type": "langdetect"
      },
      "en_description": {
        "analyzer": "english",
        "type": "string"
      },
      "title": {
        "languages": [
          "de",
          "en",
          "ja"
        ],
        "analyzer": "_keyword",
        "position_increment_gap": 100,
        "language_to": {
          "de": "de_title",
          "ja": "ja_title",
          "en": "en_title"
        },
        "type": "langdetect"
      },
      "de_seeker_pic_id": {
        "analyzer": "german",
        "type": "string"
      },
      "ja_seeker_portfolio_id": {
        "analyzer": "ja_analyzer",
        "type": "string"
      },
      "ja_seeker_cv_id": {
        "analyzer": "ja_analyzer",
        "type": "string"
      },
      "de_city": {
        "analyzer": "german",
        "type": "string"
      },
      "de_seeker_portfolio_id": {
        "analyzer": "german",
        "type": "string"
      },
      "de_seeker_cv_id": {
        "analyzer": "german",
        "type": "string"
      },
      "en_seeker_cv_id": {
        "analyzer": "english",
        "type": "string"
      },
      "de_title": {
        "analyzer": "german",
        "type": "string"
      },
      "seeker_pic_id": {
        "languages": [
          "de",
          "en",
          "ja"
        ],
        "analyzer": "_keyword",
        "position_increment_gap": 100,
        "language_to": {
          "de": "de_seeker_pic_id",
          "ja": "ja_seeker_pic_id",
          "en": "en_seeker_pic_id"
        },
        "type": "langdetect"
      },
      "created": {
        "format": "yyyy-MM-dd HH:mm:ss",
        "type": "date"
      },
      "ja_description": {
        "analyzer": "ja_analyzer",
        "type": "string"
      },
      "ja_city": {
        "analyzer": "ja_analyzer",
        "type": "string"
      },
      "en_city": {
        "analyzer": "english",
        "type": "string"
      },
      "seeker_cv_id": {
        "languages": [
          "de",
          "en",
          "ja"
        ],
        "analyzer": "_keyword",
        "position_increment_gap": 100,
        "language_to": {
          "de": "de_seeker_cv_id",
          "ja": "ja_seeker_cv_id",
          "en": "en_seeker_cv_id"
        },
        "type": "langdetect"
      },
      "en_title": {
        "analyzer": "english",
        "type": "string"
      },
      "ja_title": {
        "analyzer": "ja_analyzer",
        "type": "string"
      },
      "en_seeker_pic_id": {
        "analyzer": "english",
        "type": "string"
      },
      "seeker_portfolio_id": {
        "languages": [
          "de",
          "en",
          "ja"
        ],
        "analyzer": "_keyword",
        "position_increment_gap": 100,
        "language_to": {
          "de": "de_seeker_portfolio_id",
          "ja": "ja_seeker_portfolio_id",
          "en": "en_seeker_portfolio_id"
        },
        "type": "langdetect"
      }
    }
  }
}

@netstyler

jprante / elasticsearch-langdetect

High memory usage with langdetect #74