victordibia / neuralqa

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT
https://victordibia.github.io/neuralqa/
MIT License
233 stars 32 forks source link

Enable stemmer filter in elasticsearch index #61

Open jvence opened 3 years ago

jvence commented 3 years ago

I think it would be a good idea to update data_utils.py to include a Stemming filter by default when creating Elasticsearch indices. This would tremendously improve the results returned by ES.

jvence commented 3 years ago

Here's my take on it:

index_settings = {
    "settings": {
        "analysis": {
            "filter": {
                "english_stop": {
                    "type": "stop",
                    "stopwords": "_english_"
                },

                "english_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                }
            },
            "analyzer": {
                "rebuilt_english": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "english_stop",
                        "english_stemmer"
                    ]
                }
            }
        }
    },
    "mappings": {
        "_doc": {
            "properties": {
                "content": {
                    "type": "text",
                    "analyzer": "rebuilt_english"
                }
            }
        }
    }
}